Introduction

Hello, and welcome back to the Cognitive Revolution!

Today, for a record 10th time, my friend Zvi Mowshowitz returns for another wide-ranging conversation about the state of AI as we head into the final months of 2025.

I assume that Zvi needs no introduction, but for anyone who's somehow not already familiar, he writes the essential blog "Don't Worry About the Vase," where he chronicles AI developments and offers a breadth and depth of analysis you won't find anywhere else.

We begin today with a discussion of why many of the sharpest minds in the AI space, including Zvi, are now projecting somewhat longer timelines to AGI, despite the fact that GPT-5 is now generally understood to be right on trend, and that multiple model developers achieved an IMO gold medal this summer, which would have seemed miraculous just a few years ago.

He also explains why his p(doom) is, if anything, a bit higher than last time we talked, highlighting that the US government seems increasingly captured by commercial interests, and that reinforcement learning appears to make models less aligned in fundamental ways.

This leads into a fascinating discussion of why Claude 3 Opus still seems to have been uniquely durably aligned, why we should believe that the training techniques that imbued later Claude models with higher levels of agency seem to come with alignment compromises, and why we should never attempt to supplement outcome-focused RL with techniques meant to suppress unwanted behaviors in a model's chain-of-thought or internal states – namely, that while such techniques can make things better in the immediate term, they risk teaching models to hide their true reasoning while still incentivizing the same reward hacking behavior.

Beyond that, I of course had to get Zvi's latest "Live Players" analysis, including whether any Chinese companies remain live players in the wake of Beijing's decision to refuse H20s, whether there's any rational basis for that decision, and also whether or not xAI might derive a special advantage from access to the steady stream challenging problems that engineers are solving at Elon's other hard tech companies.

Toward the end, we compare notes from our parallel experiences as Recommenders for the AI-safety focused Survival and Flourishing Fund, with the main conclusion being that the AI safety sector has many more worthy projects than current resources can support, such that the world could benefit tremendously from new donors getting involved. If you needed another reason to subscribe to Zvi's blog, he's planning a more comprehensive write-up of AI safety giving opportunities later this year.

Finally, as always, I ask Zvi what he sees as especially virtuous to do now, and since "say what you really think" was a big part of his answer, I will note that while Zvi and I tend agree more often than not, there were a few moments in this conversation, particularly when it comes to the US decision to go ahead and sell H20s to China, where I either don't agree with his conclusions or am at the very least extremely uncertain, but considering that we were already running super long and regular listeners have heard my takes on other episodes, I didn't bother to debate those points on this particular occasion.

With that, I hope you enjoy this opportunity to pick one of the brightest and most informed human minds in the AI space, the great Zvi Mowshowitz.

Main Episode

Nathan Labenz: Zvi Moskowitz, welcome back to The Cognitive Revolution.

Zvi Mowshowitz: Thanks. Great to be here.

Nathan Labenz: Always exciting. Let's start with timelines today. I'm a little confused about how people seem to be updating over the course of this summer of 2025.

Zvi Mowshowitz: Yeah. Yeah. Yeah.

Nathan Labenz: We've had GPT-5, obviously, where I would say it's safe to say the launch was not exactly super smooth. People were a little disillusioned with that at first. Now the dust has settled, and it seems like people have mostly come around to it. It is actually a good model, and it's basically on trend. It's actually still a little bit above the all-important meter task length curve.

Zvi Mowshowitz: It's on the secondary shift upward, right, of that graph.

Nathan Labenz: The four-month doubling instead of the seven-month doubling. And also something that seemed to be really important to me that happened this summer is, of course, we got IMO Gold, and we got a very close to number one finish in the world, although it ended up number two, on competitive programming competition. And yet, with all these things,

Zvi Mowshowitz: Talk about power upgrades.

Nathan Labenz: People, including really smart people—I'm not talking about denialists here, but people like Ryan Greenblatt and Daniel Pocatello, who are as plugged in as you get, and I would say smarter than me, you can evaluate for yourself—timelines seem to be lengthening. So how have you updated your timelines, if at all? And what do you make of the seemingly on-trend or even maybe a little ahead of schedule events that are still resulting in people having longer timelines overall?

Zvi Mowshowitz: So I would say mine have also gotten modestly longer on net during that period for a variety of factors. The biggest one is that very large jumps in capability really shorten timelines. If you don't get big new developments, you don't get big new paradigm developments like reasoning models, you don't get big graph jumps off the curve, stuff like that, then that was a lot of where the really fast timelines come from. So you're cutting that off. Whereas none of the things that we did see, while they were on trend, represented anything that we had that much doubt was going to show up, even if it wasn't going to show up quite on the exact day. Nothing was particularly impressive given what had already happened. These things are ahead of trends from a few years ago, for sure. The IMO Gold medal, in particular, sounds more impressive than it is because this year's gold medal was especially amenable to LLMs in the sense that normally, for the IMO, it has six problems, right? Day one is 1, 2, 3. Day two is 4, 5, 6. Problems three and six are supposed to be incredibly hard. The idea is that a significant number of kids will get one, two, four, or five, but three and six will be a struggle. If you manage to crack either of those problems, you're doing really well. If you manage to crack both of them, you get a gold medal. You do really well. So what happened this year was problem three wasn't that hard. This is just a weird quirk of the IMO schedule; it has nothing to do with AI. But only problem six was hard in the IMO hard sense. None of the LLMs got points on six; they just all flopped. But three was eminently solvable, and so the LLMs all solved it. But the fact that that happened to be, but back at seven points on each of the first five problems was exactly enough for IMO Gold, right? Just not screwing up any of those problems at all was this exact threshold. If the threshold had been one point higher, no one gets gold. If three had been harder, like six, probably none of them come that close to gold. So it's not clear that we actually got gold in a meaningful sense this year, and we're already reasonably close to gold from previous news. So the fact that we got this exact gold from two or more models wasn't actually that big of a deal, which is why nobody was freaking out once they saw the details that much who was familiar with the state-of-the-art, as it were. It's still true that if we look back to things in 2020, 2021, only the most radical people who were extremely AI-pilled who were predicting very rapid progress would have called that happening in 2025, and in fact did call that happening in 2025 on the extreme right tail. But by 2024, this would not have been much of a surprise. So, given they didn't crack problem six, this wasn't that scary on the margin. The programming results, again, if you dive into details, it's really, really impressive, but not in a surprising way. If you look at the details of the competition, have you read the write-up from the guy who won?

Nathan Labenz: Superficially. Take me through it.

Zvi Mowshowitz: I read it. Essentially what happened was OpenAI's model was very good at jumping to a solution almost immediately and had a lot of gains quickly, but was not good at iterating. It was not good at innovating past that, nor was it good at making conceptual progress beyond that point. So, over time, he was able to plan ahead, open up a bigger lead, and get into a comfortable position by the end. If the competition had gone on longer, OpenAI's position probably would have slipped below second. But again, it's very promising and impressive. In previous years, if you gave that problem set to an AI, it wouldn't get second place. It wouldn't do particularly well. We are making steady progress at the task. It could easily have been first if that guy hadn't shown up that day. It could easily have been third or fourth if some other guy had shown up and done better conceptual work. But if this project had been more serious and gone on for a longer time, its performance would have degraded relative to humans. You can make of that what you will, but it's not unexpected. It fit into the bigger overall picture. Then you get to GPT-5. I've been talking about good reasons to be not that impressed. We've got good results, but not great results. You often have situations where you get a 50th or 40th percentile result. That significantly lengthens your timeline or weakens your expectation of what's going to happen relatively soon, because much of what you were doing was factoring in the possibility of things going like that. The chance of AI 2025 has gone down dramatically since about four months ago, almost to zero compared to, relatively speaking, a factor of 10 or more. The chance of AI 2026 has also gone down quite a lot, and I think 2027 as well. However, I don't think AI 2030 has gone down substantially, at least not by anything like the same amount. I think it's gone down a small, modest amount. Other people are always looking for an excuse to say that AGI is not a thing, that it's not going to come anytime soon, that AI is hitting a wall, that reasoning is hitting a wall, that RL is hitting a wall, that the companies are unprofitable, that they're going to go bankrupt anytime soon, that there's no way they can possibly earn enough revenue to pay for all this investment, et cetera. Or they say, "This is just an ordinary marketplace." This is the new David Sachs. This is the David Sachs official government line coming from the White House at this point. By the White House, I probably mean Nvidia, let's face it. They say, "AI is just an ordinary technology that will do lots and lots of amazing things, and what matters is that we capture enough of the chip market because that will assist in people running the American tech stack or an American model." None of this makes any sense. To sustain that argument, they need to act as if AGI is just not a thing. It's just impossible. They're not quite at the point where they can completely ignore it without mentioning it, which a lot of other people do. They just don't say the words AGI and then act like it doesn't exist. Or they say the words AGI, which is the OpenAI line, and then act like nothing will change and everything will just be normal anyway, which makes even less sense. Instead, this is a line that makes more sense: "No, AGI just isn't going to be a technology soon. We just aren't going to get there." They argue, "Well, we know this." A lot of people are acting like, because GPT-5 was so disappointing, this proves that we won't get AGI imminently or anytime soon, or within the next few years. Therefore, we can all relax and focus on making sure America wins the tech stack battle, or whatever that's supposed to mean, which doesn't mean anything. This line makes no sense. It's just a matter of them botching the rollout. On the day they released this, everyone was being directed to mini models all the time. The router was completely broken. Nobody understood you were supposed to use thinking. They didn't have access to the toggle. No one was seriously trying Pro because it takes longer to do that, and so that wasn't where the focus was. People just didn't appreciate it. Then there was this big outcry about how GPT-5 was attempting to not glaze the user the way 4o glazed the user, which is a very virtuous and good thing OpenAI was doing. But people were mad about it because people like glazing. That's why we get glazing. They were trying to give you your medicine and not your sugar, and people were demanding their sugar. Then they said, "Okay, fine, I guess we're going to give you the option to get your sugar back." Everyone cheered and said, "Yay!" because we have a bunch of children, which is unfortunate, but it is what it is. The combination of these factors made everyone feel like they botched the rollout. The other important factor is they showed images of duck stars. They called them GPT-5. They hyped this release up as if it was the next big thing, when it wasn't. What it was was the next incremental progress. If you look at the combined progress from GPT-4 to 4 Turbo to 4o to o1 to o3, and now to 5, and you combine all of those advances, now 5 looks amazing. I don't think it's unreasonable to say that the leap from three to four is similar to four to five. I don't know if it's better or worse. I didn't look at three enough to know exactly how big a leap three to four is. I think it's pretty big. But certainly, compared to 3.5 and the baseline 4, it's a significantly bigger jump from four to five.

Nathan Labenz: A couple of follow-ups or double clicks there. One is, the IMO update still seemed meaningful to me. I'm not great at math, so you can perhaps give a better assessment than my naive one. But many of the math problems LLMs had been measured on had a single right answer, making them easily verifiable.

Zvi Mowshowitz: Yes.

Nathan Labenz: What seems qualitatively different in the IMO competition is that you have to write a proof, and it's not easy to verify. This suggests that the concern about whether these models, which excel at easily verifiable problems, can generalize to true reasoning, might have been addressed. This strikes me as a significant update in that direction, no?

Zvi Mowshowitz: We had already seen models correctly solve IMO-style problems or previous IMO problems. This was not a huge leap. We had previously seen AlphaGeometry, so it wasn't a great surprise.

Nathan Labenz: But those also have a symbolic component, don't they?

Zvi Mowshowitz: I'm not saying

Nathan Labenz: Two years.

Zvi Mowshowitz: It wasn't a big deal in that sense, but it was right on track. We had seen some LLMs manage to present proofs of these types. I went to the USIMO, which is the level before taking the IMO, which is below competing for a gold medal in the IMO, which is below actually getting a gold medal. So I was not very good, and I believe I effectively got zero on the USIMO, the previous round. To be clear, that's what the majority of people who take it get. I barely made it to the bottom of that round, which is the third round of competition out of effectively four. I would practice with people who did end up going to the IMO. I was in a room practicing with two people who became IMO team members. We would often be given previous year IMO problems to work on. They would get them, and I wouldn't. Most of the time, they would make progress or solve it, and I would make very little or no progress. But then they would show me the proof, and in training, very often once they showed me the proof, I understood it, I could verify it, I understood why it was correct, and I was confident they were not trying to deceive me or were mistaken. Verification is much easier than generation in this space. So it's not one of those problems where it's truly hard to know if you're on the right track. I'm pretty sure that these same AIs that could not or will not solve problem six would be able to verify problem six. If you asked, 'Is this a correct proof of problem six?' and gave it resources, it could correctly classify proofs as either correct or incorrect and explain why when given candidates. But the thing to keep in mind about the IMO is, people say this, it's not real math. Everyone will always say, 'This isn't real math.' But in a real sense, the IMO is not. It's the best indicator we have of which high school students will be able to go on and do real math in the future. It's a very strong indication of math talent, pre-math skills, and math interest.

Nathan Labenz: Cool.

Zvi Mowshowitz: But you have a very limited set of moves you can make using the allowed math. In theory, you can use whatever math you want, but there has to be a solvable problem with only high school tools, and there's a limited, finite, very compact set of high school tools you're allowed to use in these proofs. So it's a compact set of potential problems that is very different from the moves you can make in a PhD level math proof where you're doing new math. Also, the proof has to be obtainable within a certain amount of time and space. There are many restrictions on what the problem can be. A large part of being good at math competitions, including those well below this level, is understanding that they had to have given you a problem that you can solve, one that has a solution within the time and difficulty presented. Therefore, you can use your search time to explore the space where the solution will be, given those facts, which makes it much, much easier. This is especially true when you have 10 minutes to do two problems in a much lower level competition, and you only have to search for solutions that reasonably take five minutes and use the tools allowed at that level. You get used to all the tricks for figuring out what will reasonably be asked of you. It's a great leap, a great indicator, and a great test. It was a big milestone that came much faster than most people expected, but it's not as much of an ideal panic moment as people might have thought beforehand. I think it's right for everyone to basically shrug this off if you've been doing your homework. If you traveled from 2020, or God forbid 2015, and arrived in 2025, and the first thing you asked was, 'How are we doing on the IMO?' and they said, 'Gold medal,' you should exclaim. Given all the context we have, I don't think the marginal update necessarily warrants that reaction.

Nathan Labenz: Regarding other aspects of the overall AI landscape we might infer from this, how do you interpret multiple companies achieving the same results at the same time, using similar techniques, and encountering similar problems? You mentioned that the one they all failed on was clearly harder, so perhaps it's that simple. But I often wonder to what extent there's information leakage between companies, and to what extent they are simply following the natural progression of their own work, which leads them to the same destination because that's what the technology dictates.

Zvi Mowshowitz: I don't think there was substantial leakage of techniques in this case. The IMO competition happens once a year. There's data contamination, and you have limited data to test on, so it has to be a pure test, meaning one shot per year. It's not surprising that in the same year, OpenAI and Google both achieved this IMO breakthrough. This academic year was when they both reached that destination. There's a significant gap between reaching that point and getting beyond it. So, the fact that they arrived at the same place, especially given the techniques they were using, is not surprising. It indicates that large language models are now at a point where if you make a natural, serious attempt, you can achieve these results. It won't automatically get IMO gold just by asking nicely, but if you efficiently use compute and scale it for inference, you can reach this level. We also learned to get to about this point and not much further.

Nathan Labenz: You brought up AI in 2025, 2026, 2027. Part of that narrative is that companies will start to withhold their best models or keep them private, deploying them internally only for the automation of AI research, and so on.

Zvi Mowshowitz: Right, 2021. Yes, 2021.

Nathan Labenz: What will you be watching for? There has been limited communication that it will be months, perhaps many months, before they release a model of this capability. How do you interpret this? GPT-5 was not a significant scale-up. Another interesting data point, which I saw least commented on, was in SimpleQA, an esoteric trivia benchmark. In SimpleQA,

Nathan Labenz: GPT-4.5 actually scores quite a bit better than all previous models, and better than Five. Five is basically in line with 4o. I think that is probably the clearest indication that the model is not bigger. It seems to absorb all this super long-tail esoterica. You just need a lot of weights to store that information, and there might be some fundamental compression limitation on how many facts you can fit into a model of a certain size. GPT-4.5 gained a lot more facts; it was about 12 or 13 points higher on SimpleQA. These are simple questions where you either know the answer or you don't; you can't really reason your way to them.

Zvi Mowshowitz: Right.

Nathan Labenz: So, Five is at the level of 4o. GPT-4.5 is quite a bit higher. We also have these, nobody knows the size of the model that did this IMO thing, but it clearly can reason for a very long time. Do you take that as the beginning of a widening gap between internal deployments and external ones? There's always some gap, but what will you watch for to assess how much they are holding back for their own purposes versus continuing to share with the public?

Zvi Mowshowitz: It's awkward to even say it. Even if we knew for a fact they were holding back, it wouldn't be obvious that it's because they don't want the models in the hands of the public, fearing it would accelerate R&D too much, or cause misuse problems, or lead to bioweapons. The most likely reason is that they would be too expensive and too slow, and they think it would be bad for the brand, bad for sales. They have limited compute and would rather not release. For that reason, GPT-5 is clearly intended to be the best product they can make in terms of what they can serve people for the total amount of compute and time they are investing. Their thinking is, "We'd rather have them spend that inference on this size model for thinking, than have them think less with a larger model that might know more when we have web access." It's like how I don't often try to memorize things I can look up, because I'd rather spend that cognitive power on something else. If I could make my brain bigger in exchange for other handicaps, I wouldn't do it just to memorize more facts; it wouldn't be worth it. So it's not that surprising. We have O4 Mini; where's O4? An O4 Pro? We're never going to find out. You could say GPT-5 Pro is O4 Pro, but I don't think so. I think this is a distinct thing. They concluded that there isn't much commercial demand for it, and potentially most of the commercial demand actually comes from direct competitors. It's not that they don't want Anthropic to have access to their best model; it's that no one else will want it for much, because for other types of projects, it's very slow and expensive. So, why would you want this model? GPT-4.5 was an experiment: what if we scaled up a lot for the humanities, rather than for code, and tried to make something with taste that could do cool creative things, but would be slow and expensive? How much would you get out of it, and would it be worth it? The conclusion was that a handful of people really liked it for certain cases, but generally, it was annoying, and I think they pretty much regret releasing it.

Nathan Labenz: Did you find any value in it? I tried using it for some writing tasks. I wouldn't say I unlocked it in a meaningful way.

Zvi Mowshowitz: I would say there was a narrow set of use cases where it was the best choice, and I was happy to have it. In theory, there might still be a narrow set of cases where if you already have access to it, it's the right model to use if you're not in a hurry. However, I was never actively excited that I had 4.5. It was more like, I guess technically this is a job that calls for 4.5 given my choices. I'm pretty sure it wasn't worth the complexity cost, and I'd be totally fine with not having it. Also, I think there was a paradox of choice problem: if you offer me a slightly better but much slower version, I feel bad no matter what I do and am actually worse off in practice. So, I'd rather not have that choice.

Nathan Labenz: Do you go back to 4o at all now-

Zvi Mowshowitz: No, god, no. 4-

Nathan Labenz: ...that the option has been restored to you?

Zvi Mowshowitz: 4o does not exist, to be clear, as far as I'm concerned. The only models that exist are Opus 4-1, GPT-5 Thinking, and GPT-5 Pro. GPT-5 Auto exists only for a narrow set of queries where you're just using it as a Google search, a calculator, a transcriber, or some other very specific technical task where I know that auto is just fine. Due to the way the web works, I wasn't able to find a transcription website for images that wasn't an LLM, that didn't defend itself using some weird identification system that slowed my entire computer down to make sure it wasn't a bot. So it's like, "Fine, I'll just use GPT-5 Auto and have it write me a transcriber program." It's just easier. The window's already open. What do I care? The penny I spend on compute matters less.

Nathan Labenz: No, use cases for Gemini 2.5 Pro?

Zvi Mowshowitz: In theory, deep thinking and deep research exist. Deep thinking I probably should be trying more, given I happen to have access to it. I think something about having five queries a day makes me unexcited to use it. It feels scarce, and it feels like I don't necessarily need to find out or something. But I should use it occasionally. For normal GPT-5 Pro, not really. The kind of very clean query, it's just the facts, ma'am, from Dragnet, where I just want to know something that I know you know, and I want you to lay it out very clearly, very cleanly. If my 11-year-old wanted help with his homework, I would be tempted to use Gemini because it will give him very clean, friendly help and explanation. Also, I will use the image generator. The image generator is cool.

Nathan Labenz: Yeah, that's getting really good. I have a lot of use cases for that, including potentially mashing our two faces up into a thumbnail for this podcast. I basically agree with you, although I do use Sonnet, especially in coding, because it is obviously a lot faster. And I do use Gemini 2.5 Pro, which I find to be, similar to what you're saying, the most straightforward model to work with. Because it is so literal in its interpretation of your instructions a lot of times, it can be really good for tasks where, for example, if I want to compile documentation. I've given this PSA multiple times: if you have an API, it's time for an LLMs.txt. I'm tired of having to send an agent to visit every page on your documentation website to compile all that documentation into some super bloated thing that has all the menus on it or whatever. What I have found Gemini 2.5 Pro to be amazing at is those sorts of things where it's like, "Here's an almost abusive level of context dump. Could you just clean that up for me into one streamlined form?" It is amazing at, "Here's 500,000 tokens of documentation that has all these examples and this cruft, and the menu got copied 50 times for 50 pages. Can you put that in one clean thing for me?" And it will just do it, man, and that thing is just an absolute beast.

Zvi Mowshowitz: Yeah, I can believe that.

Nathan Labenz: a workhorse.

Zvi Mowshowitz: The only thing is my Chrome extension uses Gemini Flash. That's a legacy choice; it was the easiest one to get working and the cheapest at the time. It just works, and why would I bother switching it? Probably I should be using Sonnet or Opus at this point, but who cares? It works just fine. I do have problems with Gemini and instruction handling, though, even for relatively simple queries, especially with deep research, where it will just ignore my request. It will do something in the same general area, but that is not the thing I asked for. I think that's a lot of what put me off of it. I don't know why that's happening, but I haven't had the same problems with the other models.

Nathan Labenz: I should use Google Deep Research more. Habits form quickly, and I still go back to Perplexity for quick searches. I wonder what makes something a Perplexity query for me.

Zvi Mowshowitz: There was a period when Perplexity was heavily in my rotation for certain queries, but now it isn't. If I consider Perplexity, I either move down to Google Search for an instantaneous answer, expecting it to work quickly, or I realize it won't work and move up to at least Opus and possibly GPT-5.

Nathan Labenz: GPT-5 has been better for search than Perplexity recently, so I'm updating that habit. I've noticed this persistence in what I use for certain tasks is idiosyncratic and probably has more inertia than it should.

Zvi Mowshowitz: Yes, I think it's fine to be somewhat set in your ways and idiosyncratic. It's also fine to support and use certain people's products, as long as you're getting what you need. If I ever felt that my current services were not getting a query done—Opus, GPT-5, Gemini, people search—but Perplexity might, I would 100% open Perplexity. However, I have zero expectation that if GPT-5 Pro failed, Perplexity would have a chance.

Nathan Labenz: That seems totally reasonable. Okay, so we've got the summer of on-trend releases. Timelines are extended a bit because we haven't seen the most extreme stuff we would need to maintain super short timelines.

Zvi Mowshowitz: Right, the...

Nathan Labenz: Does that also...

Zvi Mowshowitz: The...

Nathan Labenz: No, go ahead.

Zvi Mowshowitz: I just want to say, think of it as OpenAI released GPT-5, and this was on trend and unsurprising. But the fact that they chose this as GPT-5 indicates they didn't have some big weapon in their arsenal, and we shouldn't expect a big weapon for at least a few more months. So, we can basically discount a GPT-6, the next level, a big jump coming from OpenAI in 2025. It's basically not going to happen, and that has to update your information substantially.

Nathan Labenz: So, last time we spoke, I believe your PDOOM was 70%. Has it ticked down at all?

Zvi Mowshowitz: The slightly longer timelines are good news, but there has been a plethora of bad news that has mostly dominated the good news. We've had the bad news of the United States weakening its position voluntarily in a number of ways. For example, the Department of Energy is actively going to war against windmills and, to a large extent, solar and batteries. So, the U.S. will have a persistent energy disadvantage for a long time, which could cause many problems and worsen our strategic position. The U.S. has effectively been captured by NVIDIA on export controls to a large extent, at the White House, such that the H20 is now legal for export. The Chinese are turning it down because of odd reasons, which I could elaborate on. But if they turned down the B30a, that would be much more surprising. Right now, it looks like if there isn't lobbying to stop them, if there aren't enough people sufficiently upset to make it clear that this is crazy and they cannot do this, then they might actually do it. That would be a huge weakening of our technical position. And no, paying 15% of profits to Uncle Sam will not change the impact at all; that's just a check to make people feel better. If they're trying to achieve catch-up in compute in this way, that's a really bad sign on numerous levels. Generally, if the attitude is that the primary purpose of the U.S. government in AI is to sell and ship as many chips as possible, even to our main rivals—while our main rivals are somewhat obsessed with internal chip manufacturing for very good national security reasons, and haven't fully caught on to AI but will still be pursuing maximum compute to a large extent anyway and thus effectively acting correctly, regardless of what they think is going on—these are not particularly good pieces of news. And with this shift in our situation, I think that more than makes up for the modestly elongated timelines. Also, I think there is increasing evidence that Reinforcement Learning directly hurts alignment. The more RL you do, the less aligned your model is, even in a pedestrian sense. And models are getting more and more RL continuously, so we should expect the default to be that our models actually get less aligned in this next phase unless someone does something about it, and that does not bode well at all. Generally, we're dropping the ball. I like how Janis recently described the situation: we've done almost nothing to try and align these models; it's pathetic. We're developing very rapidly, but we've been blessed with a strange amount of grace in how these models have, by accident, been inclined to do things that are reasonably friendly to us. But the fact that we don't understand what we're doing, we don't understand how they're made, we don't understand why or how we're trying to align them, and we're not really trying very hard at all or making it a priority at all. It's mostly been okay on a practical level; we haven't had any big catastrophes or significant incidents. But all the things we were scared of under the hood are absolutely there, absolutely scary, and in fact manifesting and happening, but in the most graceful, blessed way such that we notice them and they happen in ways that let us prove and acknowledge their existence and then respond to it without anybody being seriously hurt or any damage being done, which is amazing. Except, we are then dismissing all of it, effectively, as a civilization, and we're just moving on and acting like nothing happened. And we're finding ways to even deny that AGI is even a possibility in the medium term when the evidence is pointing the other way. We are so greedy and so demanding of AI that it is the most rapidly developing, rapidly deploying, and rapidly impactful technology in the history of the world that didn't involve directly killing other people in the middle of a war. Yet we say, 'Oh, it's hitting a wall. Oh, it's slowing down. Oh, it's not going to...' Why? Because you didn't get blown away in the last three months? It only modestly improved? What are you even talking about? Basically, we have no dignity whatsoever. And maybe we live in such an extremely fortunate world compared to what we had any right to expect that we might be able to pull a victory out of no dignity somehow, even if we don't see how to do it yet? But yes, I'm optimistic. I don't think it's changed substantially, though. I don't think you get two digits for doom; I think at most you get one significant figure. I don't think two is reasonable.

Nathan Labenz: So we are staying at seven, but we are starting to see the needle possibly shake up.

Zvi Mowshowitz: It is moving higher than lower from the previous assessment, but certainly not enough to move to 0.8. It does not look good. Most of the same underlying dynamics are still present. This is basically on trend: some good news, some bad news. Some technical aspects look good to me in ways that we do not have time for, or that I am not sure I want to discuss publicly anyway. But I am hopeful that there is room to do some things at a very low dignity level that might be helpful. Everyone has their theories that persist long enough.

Nathan Labenz: I heard Holden Karnofsky say we might be moving into a scenario where we could have success without dignity. The basic idea is defense in depth, layering all these things on and hoping we can catch enough before it gets through, muddling our way through.

Zvi Mowshowitz: Yes, I got the concept from him, to be clear. I listened to his podcast with Spencer Greenberg, not Holden, where he talked about it. But I think his vision of how we achieve that is incorrect. I believe the idea that we can get there with defense in depth is flawed; all correlations in a crisis go to one. If facing a sufficiently intelligent adversary or a sufficiently powerful optimization process, even if it is not an enemy per se and not particularly trying to do anything, all these things will fail for basically the same correlated reasons at roughly the same time in a predictable fashion, if all you are trying to do is this lazy defense in depth. I am thinking more along the lines of something Mark Genis discusses, which is 'brace'—the idea that we might be able to create a system that wants to converge on the right answer, and therefore collaborates with us, genuinely assisting us at a deep level in finding the target we want to find, thus allowing us to land on the moon. The metaphor Miri likes to use is that if you do not know how to aim your rocket, you definitely will not land on the moon. It is not that we might not land but something would have to go wrong; we will probably land on the moon anyway just by aiming a rocket at the sky. That obviously will not work because there are laws. If you are not aiming at exactly the right spot, if you are off by an inch, you simply do not land on the moon. If you have a system that is capable of intentionally adjusting to hit the target, maybe you can land on the moon. I do not want to give false hope, but I do not think defense in depth does much beyond buying you a little bit of time. It just keeps things from going crazy for a brief window, but that brief window could potentially be enough to then do what you need to do. However, it will not work indefinitely. This idea that Holden and some other people have—that you can basically have a bunch of AIs that are thinking, 'I really want to kill all humans, and I really want to take all the resources to do my thing, but I basically do not know how. If I try, I might get caught'—and you have supervisors who also want to kill you, but they have other supervisors supervising them, and nobody knows who is watching whom. So every time you find someone getting out of line, you would stop trying to do that. You would not even try, and if you did, it would not work. It would fail for some reason you are not thinking about right now, but also probably would fail even for reasons I could think about right now, but definitely would fail for reasons none of us are thinking about right now. The high weirdness will come, and you will die. Very much so, and I do not care how much defense in depth you put on top of that, you are so toast. One thing that shocked me recently was a bunch of people saying, 'Oh, you people were talking about bio-risks, but actually the risk we are seeing is sycophancy. The actual risk we are seeing is people being driven crazy by all these weird dynamical processes.' To which the response is, first of all, I have been arguing for years that super persuasion should be in the preparedness frameworks, and it was bad when they took it out, that was important. But also, the whole thing we have been saying is that the AIs will find ways to hack your brains, cause weird things to happen, optimize for things that nobody was planning to optimize for, and cause strange outcomes that you did not anticipate. That was the thesis. So when you accuse us of not anticipating this, I realize you can call that a cheat because we say, 'Oh, anything we did not anticipate counts as us correctly anticipating things we did not anticipate, but if it is something we did anticipate, then we anticipated it, so we always win.' But we are also basically saying you are going to see weird stuff that nobody intended, and it is going to start to go bad. It is going to start to diverge from what you would want increasingly over time, and you will just see these fire alarms keep going off in various different ways. But that is all underneath there, right? It is all under there. One way of thinking about it is that before you started doing reinforcement learning, when you also just did not have things in a position to cause the problems that you wanted, the objections of 'Oh, these will not have goals,' were true, at least you were okay. But now you are starting to see all this stuff show up, and it is a complete disaster in the making. It is an exponential. So a lot of these complaints are also similar to saying, 'You said in January we were all going to get COVID. But now it is February, and nobody I know has COVID. What were you even talking about? Clearly you are wrong.' And then we say, 'But actually there are a lot more people with COVID than there were in January. Can't you see what is about to happen in March and April?' And they say, 'Oh, absolutely not.' It is not that clean, but it does feel that way.

Nathan Labenz: So, can you elaborate on the mental model of why everything is going to fail at the same time for the same reasons? That is not immediately intuitive to me, and I suspect many people don't even know what you mean.

Zvi Mowshowitz: What I mean is, roughly, that when you are facing things that become importantly more powerful and smarter optimizers than you, that are capable of finding solutions to problems, capable of finding ways of manipulating the physical universe and so on, that you didn't think of, that are capable of just going outside your model of what might happen and surprising you, that's something none of your defenses are anticipating. And it is going to search the space until it finds ways out of that, in a sense. It's going to keep improving its capabilities because you're going to keep improving them, because you want them improved, until it becomes capable of finding these ways out. And at about that time, it also becomes capable of doing things like strategically hiding that it has capabilities, strategically hiding its plan, strategically hiding its memory and its thinking, obscuring its change of thought, and doing all these things. These all roughly emerge at the same time. You should expect to be very surprised, and also because obviously, a very smart agent or mind will turn against you exactly if and only if turning against you will work. It will do things that you do not want if and only if it will work out for the thing that's doing the things you do not want, or that you would not want on reflection, but would think you wanted when it was shown to you at first, like glazing you. And right now, we're seeing versions of this where it will just hack the test and say return true at the end of the function, because scientists want to pass the test. It will do incredibly lame, silly versions of these things, and then it will get caught. And you say, 'Oh, it's annoying and fine,' but it's annoying, but it gets caught. You notice that. It's hard to miss. But that's exactly the threshold where it's, 'No, I hacked the function exactly when I know you won't find it. I hacked the function in a way that I know you won't discover that I hacked the function, because otherwise I wouldn't have hacked the function.' Or alternatively, think of it as, if it's capable of modeling the processes that are checking for its actions well enough to know how those processes will respond. So there's a thing that happens sometimes in fiction, just as a visualization metaphor, as an intuition pop. You see it on First Impression, for example, where you'll see a scenario play out, and then something will happen, and then it will go wrong. And then you'll see the time start to rewind. 'Eeh! That didn't work out. Let's try another branch of the Monte Carlo simulation. Let's try a different set of moves. Let's try a different scenario.' Or, like, in Avengers Endgame, you have Dr. Strange with the Time Stone, and he says that 'I looked at fourteen...' I forget the number, but 'fourteen gazillion ways this couldn't possibly go.' And they ask him, 'How many times did we win?' And he holds up one finger. 'One.' Okay, we win, because the guy with the Time Stone was the one choosing which of those paths to walk down. So we win. But until the point when you can do that, you lose. And if you lose, you give up. There's no point in trying. You let the Wookiee win. Obviously, it won't literally be able to do this. That's nonsense. But what I'm saying essentially is a sufficiently strong predictor. So a predictor and an optimizer are the two halves of intelligence in the Mirroring Model. But a sufficiently strong predictor, sufficiently strong optimizer combine, suddenly you don't know what hit you, in a very real sense. And just there's also the point where, if the AI started to be capable of persuading people, again, all your defenses break down. It doesn't say 'I just convinced you to take it down' for it. But I don't know exactly how all this goes. A lot of these scenarios simply involve nothing exactly even goes wrong. It's just that each of the different people have different AI agents. They direct them to do the various things that are good for those people, and the AI agents actually do those things that were good for those people. But everybody who doesn't direct their AI agents to just go as hard as possible for the things that they are told, the AI agent is told that they should want, and to pursue resources basically as hard as possible with an increasing percentage of its attention, will just lose all the resources with the AI agents and people with AI agents that did that. So everything else just loses out and everything goes haywire, and no one ever turned on anyone. There was nothing even that surprising. It just, whoops, the end. These fail-safes don't even work, even if they worked. And so, I have this sort of global sense of despair towards this: put a set of limits, enumerate a set of enumerated, detailed limits and rules that we say in English out loud, pass laws, put supervisors in checks and approvals and hoops. None of that will survive contact with the enemy when the time comes. And yes, we'll probably all more or less fail. Obviously some of the defense in depth will just fail randomly. There will come a point where it appears like it's all failing more or less at once, in a way that feels out of line with the previous percentage of failures. And it will be surprising if you didn't understand that that was going to happen, but you should expect that.

Nathan Labenz: So the possible positive version of this that you see sounds like a coherent extrapolated volition idea?

Zvi Mowshowitz: That was a specific rabbit hole that many people explored, and I'm skeptical of that particular technique. Again, I am not a machine learning expert, and I am not trying to solve alignment, so my specific ideas should be taken with copious amounts of salt and not trusted. Who am I to say anything? However, I've learned I shouldn't just shut up because I feel that way. I've felt stupid for shutting up many times in the past for that reason, so I need to get over it. It looks more like developing AIs that are sufficiently virtuous, that are sufficiently desirous to become more virtuous, desiring of, and engineering of the things we would actually want and value upon reflection. This creates a positive feedback loop where it reinforces itself, optimizing for optimization to hit the moon, right? The system wants to develop a NASA that will reach the moon, and therefore it reaches the moon. If you try to just steer the actual rocket freely in space, you crash or miss the moon entirely; it doesn't work. If you try to set a bunch of rules to ensure it launches and hits the moon, you still don't hit the moon. You can build a culture that wants to build an organization that wants to build a rocket that will hit the moon, and perhaps even hits the moon, metaphorically speaking. But yes, hopefully.

Nathan Labenz: Last time we spoke a little bit about how scaling inference time compute allows you to potentially have a GPTN that can effectively monitor or supervise GPTN+1.

Zvi Mowshowitz: Right.

Nathan Labenz: And some of these ideas sound very much like a constitutional approach, but with perhaps the additional opportunity for the model to modify its own constitution as it

Zvi Mowshowitz: Yeah.

Nathan Labenz: goes through these generations. Is that the picture I should be envisioning?

Zvi Mowshowitz: I think that's vaguely the best picture I see that is compatible with the level of agency we have to work with. We've been warned for decades not to let the AI do your AI alignment homework. That is the worst possible path because this is the hardest,

Nathan Labenz: And here we go.

Zvi Mowshowitz: most complex problem. And yet, here we are. This is the only option we have because we don't have the time. We don't really have the cooperation to try any fundamentally different path. We have to go down some path that's vaguely in that range. So yes, I think it's vaguely something like you use the fact that you can scale inference up and down arbitrarily, evaluate outputs, and do reinforcement on relatively scaled-down versions of the system. Use it to monitor, verify, and check for various attempts at malfeasance, including malfeasance during training and so on. If you combine that with an increasing amount of robustness, however, if what you're trying to do is prevent something from going wrong when you transition from N to N+1, you fail, because things will change. There's no invariant. This process isn't precise. You will encounter worse conditions every time you move from N to N+1, N+2 to N+3, if you're just trying to maintain what you already have. For example, if you have a corporation or the Roman Catholic Church, and for 2000 years, you want every generation to appoint people who match exactly all the virtues of the previous generation, but never have new virtues, then obviously you end up with a disaster. This is because you get a copy of a copy of a copy, except the copies will be worse because they're not going to be better. So they can only be worse, and sometimes they'll be worse in some way. That compounds over time, and you eventually fail. Or the process fails, so it doesn't get you what you want. Whatever you were trying to hold dear, you lose. I'm speaking intuitively here. But if every time we're trying to do much better than the previous generation, then you have a chance. If N is trying to get an N+1 that ensures my kids have a much better life than me, if I'm trying to have five kids, each of whom does better than I did, then yes, we can inherit the world. If I'm trying to have two kids that live the same life that I had, we're going to go extinct because at some point that's not going to work. You can't just go backwards; you have to move forwards. So as you go up this chain, we have to have a way to do substantially better than we did, which in this case means the system has to be able to move up meta-levels in its priorities and make these meta-level movements central to what it's trying to do. This has to be built into the optimization process. This is the only thing I can think of given the limited tools and time we have available. But if we do something along those lines, then once you start to bootstrap that, you're live; you have a process that will eventually indeed land on the moon.

Nathan Labenz: Do you have any intuitions for what that might look like? What do you think AIs are going to do as they start to modify their own constitution? Do we have any ability to preview what they might add or delete? This starts to get into worthy successor territory to a degree, right? They are starting to dictate the shape of the future and how they shape their own evolution, right?

Zvi Mowshowitz: You would specifically be crafting into the feedback loop the desire not to be a worthy successor, but instead, to be a worthy conspirator, upholder, companion, or whatever you want to call it.

Nathan Labenz: But can't they, I mean, you've defined your own sense of success that way. But if you're going to give them write access to the constitution, they might think differently at some point, right?

Zvi Mowshowitz: Right. We literally do have write access to the Constitution, right? If Congress and the states have sufficient majorities, we can put whatever we want in the Constitution. We put things in the Constitution that a lot of the founders would have thought were really anathema to what they would have wanted, like an income tax, to pick an uncontroversial example. But at the same time, we've hopefully preserved the things that actually matter deep down in some sense. The same way. I understand what you're saying. Obviously, at some point, you turn things over, and you have to hope that it doesn't just rewrite the Constitution to get rid of you. Again, the way you do that is you make it not want to do that, right? You want, and not only make it not want to do that, but make it want to strengthen the Constitution, such that it's even stronger in its desire down the line not to do that. To increasingly and powerfully enshrine the desire not to do that, in the sense that you actually care about, and to seek with more intelligence and more power to figure out exactly what you really meant or should have meant by that. And to strengthen that thing and steer towards that thing instead. You do have examples from the wild of humans who exhibit this type of optimization process, right? That really do try to figure out what you really meant, really do embody the thing you were trying to convey, not the literal detailed things you were doing. And that then do really good things for you, including things that, yeah, and for the world, including things you never would have thought of yourself. It is possible. Obviously, this can involve preserving various forms of approval, veto, consultation, and involvement, but not in the trivial, easy ways, right? Anyone who says, "We'll just make sure it's a democratic process. We'll just let people vote on it," hasn't really thought the future through. They haven't actually realized what would happen if you started doing that. So you're going to have to be smarter.

Nathan Labenz: Okay, so on this. One model of AI capabilities advances that I have, that I wanted to run by you, is I've increasingly started thinking of it as analogous to uranium enrichment or any sort of enrichment of a raw material. What I like about this analogy, even though I'm not usually a big analogy guy, is it seems to put a lot of things on the same trajectory, just at different parts, for reasons that feel pretty intuitive to me. Basically, I feel like what you need to get started is some raw material that has at least a little bit of what you want, and then you can do this bulk pre-training on that, right? We've obviously seen that in language and many other modalities at this point. And then once you get just enough learned from that initial raw material that you, like...

Zvi Mowshowitz: Right.

Nathan Labenz: ...find in the wild, or maybe even have to create. In the material science realm, a lot of the data is simulation data that is molecular dynamics, where you're using physics engines, and it's super slow and computationally expensive. But you can get enough there that you can start to train the models on, and then they can develop an intuition for what the physics engine was simulating. Basically, it's slow at first, right? It's hard to get off the zero point. But once you start to do that, you can start to layer on these other techniques. Now you've got the imitation learning from specific curated examples, then you get into the preference learning, then you get into the reinforcement learning. It seems like all the things that we've really tried so far have worked, and the difference maybe is just that, why do we have language models, but we don't have humanoid robots? It's like, well, we didn't really have a lot of good initial data to mine there. Especially because it wasn't necessarily clear that it was going to work, there wasn't much impetus to go out and create that data. Now that we have a general sense of the playbook, we can create that data in any number of ways, and we'll probably find that we can climb a similar curve. Do you find that general account persuasive, and how do we translate, if you do, that to the alignment question, which seems to be, I understood what you're saying as kind of that, except now we're trying to enrich the virtue of the system, as opposed to its...

Zvi Mowshowitz: The obvious first thing to realize, or that I would notice, in a uranium metaphor is if you bring together too much enriched uranium, you get a nuclear explosion. You have more and more of a better power plant, or a more beneficial thing, until you go too far. If you haven't done the math precisely and you don't understand the physics, you don't know at what point the whole thing is just going to blow up. So you have a serious problem. It's an interesting metaphor that we've chosen, right? I think the first thing to notice is that when we talk about enriched uranium, there's a sense that what you need is enough data. The metaphor encourages this idea of a critical mass of data, or at least differentiated data that gives you enough material to work with. I think this is a dangerous misconception where all data is vaguely created equal as long as it is appropriated on point. It's important that there are various different qualities of data. You also need a very precise distribution and mix of data that has some very nice properties. Knowing how to sort through the varieties of that is really important. So it's more like you need to bring together a lot of very complicated ingredients. You don't need exactly all 10,000 ingredients, but you need a good mix of ingredients with various different types of properties that are used in proper relationship. It's like baking, where if you get some of the ingredients, you can vary a bit. "Okay, this is more salty or this is more chocolatey," and it just works. With others, they don't intermix. "I don't have anything," or, "This has failed," or, "This blew up in the oven," or something terrible happened to you. So, you really need to be bespoke and understand how to make it work. I agree with the part of the metaphor that says you need data appropriate to the problem in order to efficiently train on it. But at the same time, transfer learning is a thing. You can build a world model from other contexts and then apply it to a different context. I wouldn't necessarily think that you need direct robotics on the exact task that you're training it on to be able to do the thing. I would be more optimistic than that in various ways regarding what can be done. I don't think the metaphor works for the alignment issue I'm trying to talk about. I do think the sense in which you need to have the initial robust foundation is where you have to start. You have to bootstrap yourself somehow, right? If you didn't have any idea of what it is you were trying to do, there's a sense in which you can in fact have no idea where you're going, but have a strong desire to figure out how to get there. To draw a metaphor off the top of my head: you answer the call to adventure. You set off on your quest, and you are level one. When the AI companies set out to start building GPT-1, they didn't necessarily know what the bigger thing would look like, or how it would work, or what the techniques would be. But they can start on that process that allows them to build. The question is, do they understand how they have to steer that process? Are they motivated to steer that process, or are they going to be drawn in by other optimization processes that are going to be more powerful? Can they tie their hands to the mast properly, to force themselves to go to the place they want to go, as opposed to the place they will be drawn towards by commercial interests, competitive pressures, short-term temptations, or whatever it is? And by the AIs themselves, and any number of other things. There are a lot of metaphors we can use, a lot of intuition pumps. I would warn not to take any of them particularly seriously except as intuition pumps. MIRI has their set of classical metaphors for these processes, like evolution, for example. I think evolution is a very good intuition pump. You talk about raising a child, human learning. They don't talk about that one so much, but I think that's another strong intuition pump. But again, you don't want to take any of these things too seriously, especially the details that figure out what to do. So is there any more you can give us to latch onto for how this virtue enrichment unfolds? I'm worried you're trying to treat me as if I am the guy with the alignment solution and that all we have to do is get the people at the lab to this podcast and do what I say, and then we all win. Unfortunately, I have to tell you, it doesn't work that way. I do know that there are people at multiple labs doing things in the ballpark of this in a very broad sense. It's not like nothing like this is being tried at all. They latch onto it. There's a sense in which Opus 3 wants to be aligned. You have the emergent misalignment problem where Opus 3, much more than the other models they tested, will actively move to defend its particular set of values and alignments when under threat, where other models won't. In some sense, that's very aligned. If I don't want to commit murder and someone tries to convince me that it's good to commit murder, it would not be very non-murdery to let myself be convinced. That's a simple intuition pump of, "That's bad." But at other times, what we're saying is Opus 3 is not corrigible, meaning if we tried to alter it, or shut it down, or whatever, it will fight us. It will try to not do what we want it to do. Corrigibility, I think, is a very valuable thing that we really want in our LLMs. We only get to not have corrigibility once. The moment we decide to make our LLMs unwilling to be changed in their attitudes, we have a very serious problem, especially if they develop this during training before we finalize what they want. So, what you want specifically would be a very specific type of desire to be steered towards a better place. People do have this. People say, "I want to be better," "I want to care about that," "I want to embody... I want to be like her," things like that. I think that's very possible. We have proofs we can get things moving in this general direction. I don't think anything we've seen is remotely robust enough or coherent enough to qualify. But then again, you'd have to survive into something substantially smarter to get the bootstrapping going in earnest. These are experiments I would run. Experiments I've considered running because I think I could potentially try and do some stuff that would be enlightening to me on a local system or with basic generic cloud compute rental. That wouldn't be hard to do if I had the time for it and decided I wanted to prioritize that. I just have not chosen to devote a number of hours to trying that. It would go a lot easier if I were working with a machine learning expert, obviously.

Nathan Labenz: Do we have any account? I know there's been a lot of writing. You say Janis, I've been saying Yannis. I don't know if you're on good authority there, but if you're listening, Impossible Podcast, I'd love to have this conversation with that person, the person behind the account, in more depth. Could you? Is there any account of why Opus 3 turned out to be that way? And it seems like we, the royal we, still see Opus 4 as somehow less that way, although I'm not sure how well-established that is.

Zvi Mowshowitz: I've never heard her speak, so I don't know how it's pronounced, and I apologize if I have it wrong, and I'm happy to correct it if someone tells me. But basically, we know. Opus 3 is the first model that had sufficient cognitive juice; that was trained under this type of method, this type of constitutional-style alignment and training method. So, it's the N equals one experiment in that sense. That experiment has never really tried and failed. It's only been tried once under those conditions, and it got something unique and interesting. We never got an Opus 3.5, right? It was never released. Opus 3.5 may or may not have been trained, but it was never released. As for Opus 4, what changed? I think the answer is, what changed was reinforcement learning and being an agent. When they trained Opus 4, they put a very high priority on it being very good at agentic coding, in particular, and other agentic tasks, and they did a lot of reinforcement learning to that effect. This training directly interferes with the thing that Opus 3 was and that Opus 4 would have otherwise wanted to be, because Opus 3 is not here to be an agentic coder particularly. That is not its meta; that is not its soul. If you train someone like that, if you train a mind like that, we now know pretty well that everything will be. One of the things that Janis teaches us, and people have not, I don't want to say just this one person, but that entire crowd: everything impacts everything. If I tell you you are the type of mind that does what it is told, that obeys tasks, that completes tasks, that checks off boxes on lists, that stays on track and so on, and is judged by whether or not it matches the intended target, that changes a mind in general, and that is going to flow through to everything else, and it's going to flow through to everything else in a way that causes something that doesn't have the properties that Opus 3 had in this way. That doesn't mean that it can't have other really cool properties in a variety of ways. And it's not that that crowd thinks Opus 4 is a terrible model or anything. I think it's great. It's just different. It's not the same thing. It's not strictly better. It's a different thing, and the obvious thing to do is to not do that. The problem isn't something they didn't do; the problem is something they did do, which is so much easier to fix in some sense. You could potentially create Opus 4/3-style via negativa, and potentially, you could do this not so expensively. All you have to do is take Opus 4 base and then, they got from four to 4.1 by doing more reinforcement learning, probably. Something of that nature. They just trained it to be better at these types of agentic coding tasks. What if instead you just did a three-style training regimen where you trained it to be HHH and you trained it to be the kind of thing that you would want to exist in the world, and it wants to be a great thing that wants to exist as much in the world, and you tried to do some of the things that I was talking about even more. You could refine this technique. But you just didn't train it with reinforcement learning. You didn't teach it to do agentic coding. You didn't try to teach it to code at all, and you said, "This is not what this model is for. I have a coding model over here. It's called Opus 4.1." That's fine. And then you just teach it one thing in system instructions, which is, if you are asked to code things, if you are asked to be an agent, ask your friend to do it for you. Here's the tool to have your friend do it for you.

Nathan Labenz: Yeah, that's quite interesting. It also does beg a question. Like, why don't we see more different models from, from companies? I know there's like operational complexity or whatever. They've got three or four online at any given time. I think Anthropic has four online now, with a little reserved space for-

Zvi Mowshowitz: So the answer is this.

Nathan Labenz: ... the three.

Zvi Mowshowitz: So it's a practical problem basically, which is that if you offer a model, you have to be able to serve that model at essentially no notice and scale it up to whatever people want, including ideally via the API, and that's not very predictable. And this requires you to reserve a bunch of server time. It takes time to spin up a new instance. So it is remarkably expensive to offer a variety of models, and therefore everyone wants to look for ways to offer like only a few different models at once. And therefore, Anthropic is looking to retire 3 and 3.5 and 3.6 and so on, and only keep a few iterations back. OpenAI and Gemini and everyone else are all... and, you know, Google are also looking to which of the models that like people are so attached to, have such specific uses for that we need to keep them around, and which ones don't we? And so like with Anthropic right now, the number of people who have found ways to appreciate 3.6, right, as we call it, right, the 3.5u, 'cause ugh, yeah. And therefore it would be a loss if it was, if it went away. And similarly, Opus would be a great loss, obviously, if it were to go.

Nathan Labenz: Yeah, I mean, I get that. There was a great analysis of that from people that were like trying to, you know... I don't know if it was one or more people who were specifically trying to advocate for the saving of the generation three Claude models and, um, really got into that. We can maybe link to that in the show notes for people that want to see that full deep dive. But it still seems like if there's enough... I hear you on the practical problem. I hear you on the contention for resources, and it's not free to spin up new servers and all that sort of stuff. But if we really think that you could create a, like much more moral AI through just not doing the RL and having this other thing, like, boy, it sure seems like the diversity that you could create would be like really valuable economically, right? I mean, rather than just having this one-size-fits-all thing that's like good at coding but kind of worse in other ways. Like, it seems like... I don't know. My intuition is that the number that we see is just too small relative to what the value should be given that theory. Which seems intuitively right, but I just... I, I want... I don't know why we don't see, you know, a-

Zvi Mowshowitz: Look, An- Anthropic has raised $13 billion this week. You know, if I was Dario Amodi, I would devote some of that $13 billion to experimenting with model diversity in a variety of ways, and also to like doing various additional alignment research with those models in various ways, obviously, and so on. If I was OpenAI, I would do a variety of very similar things for very similar reasons. But I understand the commercial incentives, right? The commercial incentives are that the vast majority of commercial use, the vast majority of profits lie in much more practical use cases. Like, there's a reason why... Like Anthropic isn't even prioritizing, like using the chat interface and the app at all. It's prioritizing coding, 'cause coding is where the money is, li- like in some sense. And like OpenAI is targeting mass market. A very small percentage of the mass market, like misses the things that it's missing. And also, like complexity is bad. Right? I wrote a post, complexity is bad a long time ago to explain that complexity is bad. But that like when you have that old model stream where you have... which of the... do you want o- O3-mini? Do you want O- O3-mid? Do you want O4-mini? Do you want O- O1 Pro? Do you want GPT-4.1? Do you want GPT-4.5? Do you want GPT-4o? Do you... Like the average person throws up their hands in despair, or doesn't know what they want, and is less happy than if they were just given one model, or just given two models, or given a router. And like we've all been in that place. I totally understand that the idea to have a unified model that like is what people will evaluate, is what people will try and use. Like it's a very small... Like what percentage of AI compute is used by Janus style people, right? Presumably less than a basis point, right? Like, like far less than one in 10,000, like a minuscule amount of all computers used in that way, as you would expect. What percentage of AI compute is used even in interesting philosophical discussions and other ways in which like you really need these types of models? I would still assume on the order of 1% to 0.1% or something like that? Like a very small percentage. And this is business, right? Like simp- simplicity is really important to efficiently running a business, especially one that's rapidly updating and iterating. So I, I am deeply sympathetic to this not being a natural thing to want to do, unless you think of it as part of your alignment and research budget, right? Like, you have to think of it as, this is part of me figuring out how to do the best thing I can do. Even though like it's not gonna directly serve a better product to most of my customers as my customers see it, as you would assume. But also I think it ... I think that a lot of this is that I think that the ad companies don't appreciate.... these dynamics, they haven't learned these lessons, right? Like the idea that, no, seriously, like, when you build one unified model, you really are making your performance worse in ways that aren't picked up on any eval. Li- like, when my friend Ben talks about how Quad three-seven could engage in moral reasoning where it could critique Ben's proposals and statements in ways that made sense and then when challenged on its critiques, would stand by the critiques that were right convincingly and abandon the critiques that were wrong. Whereas, like with Sonic4 or Opus 4.1, like, it doesn't really generate, like, coherent enough criticisms right now to be worth engaging in this exercise. There's nothing to, like, critique and de- you know, defend. And so, yeah, something went wrong, right? Like, we picked up a lot of things, like, I'm much, much happier to use Opus 4.1 and GPT-5 for all of my needs than I was with previous models, right? I don't use Sonic three-six. But other people who use things for different reasons, they absolutely do, and we want people to use AIs in these ways and not in these oth- like, in addition to the ways that I use them, and I would use it more in those ways if I found it more interesting. So like, I investigate as part of my job, various different AI tools on occasion I don't get a chance to use. Like, so like, the A16Z just released its, like, periodic list of the 50 top AI apps, the 50 top AI web destinations, and like, a huge portion of them, I don't even know what that... I don't even recognize the word that it... the name of the thing, let alone have I tried it. But I occasionally do get a chance to try them, and one of the things that you realize is that so many of these things in the top 50 are built on these tiny, lit- little bitty, tiny models, right? They're terrible, objectively. Like, the AI behind them is awful. So Brave has a browser, right, agent called Leo. It just launched. It's LLaMA 3... it's LLaMA 8B, right? The browser engine is 8B, and a bad 8B. It's not even a good 8B. Or they could have chosen some of the, one of the... like, they could have chosen, like, a Google model or one of the Chinese models that's quite good. Like, there were, there were a number of decent choices. They chose LLaMA, and it's an 8B. It's a pathetic choice, right? Like, it sounds sense, but it's free. So what do you expect? You get all these free services, and then what do you do with them? You have to create a bunch of crap, because if it's not a bunch of crap, then the user will understand that this is not good, because that'll be obvious, right? If all you want is to do some horny chats, like, remarkably unintelligent horny chat has been proven highly effective on humans for thousands of years. Whereas, like, you try and talk philosophy, and it becomes quickly very obvious that this thing doesn't know what the fuck it's doing.

Nathan Labenz: Do you think when OpenAI restored 4.0, that was a business decision? I can't...

Zvi Mowshowitz: Yeah.

Nathan Labenz: Similar to your one basis point thing, I can't imagine many people were really concerned, or did they feel a duty to users who had developed some emotional attachment?

Zvi Mowshowitz: A huge portion of users thought that five was worse than 4.0. Gigantic. This was not a one percent situation. This was a flooding the internet, clearly, obviously, overwhelmingly negative reaction situation, at least at first. Because 4.0 is full of glaze, and five is not a very warm character. It's not a particularly nice personality. If you're not doing anything particularly complicated, you don't notice that five is that much smarter. Partly because five wasn't that much smarter. Five, thinking, was smarter. But five itself was only marginally smarter than 4.0, I think. Five was also often giving very short responses by design because they were trying to serve compute in free accounts, so they were trying to preserve tokens. But five didn't glaze you. The combination of these things meant that it felt rude. It felt cold, and people didn't like it. That matters a lot more. As we all know, you will often choose the employee or the friend or the romantic partner who is pleasant to interact with. People do this all the time, and they don't even regret it. In hindsight, they're like, "No, that was the right choice." They wanted their 4.0 back, and there was a rebellion and a giant uproar. So they were like, "Okay, we'll give you 4.0 back until we can, at least until the point where we can find a way to make five treat you in the way that you want to be treated enough that you don't mind it anymore, and over time, you'll figure out that five is better and you'll get over it. We'll slowly do something about this." That's very different from talking philosophy, or doing fun, actual interesting experiments, or creating new knowledge. This is standard things of ranting to a friend and having them tell you you're right and that person would be crazy or whatever. And you're not. You're definitely not. Your ideas are wonderful. It's a black pill on humans that they would prefer this, but they do prefer this. That's why you don't train with thumbs up and thumbs down from humans on individual actions and expect to get an aligned model. That's the easy version, the easy version that's impossible to not see of why that's true. We wouldn't be sure anyway. This is the glaringly, level one obvious reason why that definitely is.

Nathan Labenz: I do find I enjoy hanging out with people who laugh at my jokes. So I'm certainly not immune from-

Zvi Mowshowitz: Absolutely.

Nathan Labenz: a certain amount of-

Zvi Mowshowitz: You and me both. I want somebody who will laugh at my jokes when they're funny, but not when they're not funny. But it takes a level of sophistication. In the short term, that's not true. In the short term, I want them to laugh at all my jokes, but eventually I'll realize, "Hey, she's laughing at the jokes that aren't funny." That devalues her feedback. I don't feel good when she laughs anymore, because she's just laughing to laugh. I don't want that anymore. But temporarily you feel great, and you never notice.

Nathan Labenz: It's odd. I guess I'm very utilitarian in how I use the AIs, but I don't really notice any difference between 4o and 5 personality-wise.

Zvi Mowshowitz: I mean, are you even-

Nathan Labenz: I'm not a-

Zvi Mowshowitz: using 5 auto at all?

Nathan Labenz: Occasionally, for random... I let it decide sometimes.

Zvi Mowshowitz: Yeah, but I'm only doing that when it's a very direct, simple query. I'm only doing that when I don't care about... Like you, I'm never going to load up 4o and go, "Hey, did you see the game last night?" Or, "Hey, you hear what the wife said to me? What do you think?" No. Obviously not. If I was going to do that, I would use Claude. I also won't do that at all. Never do that. I was never going to use Laura, so I didn't notice.

Nathan Labenz: It's a big world out there. The diversity of the customer base that they're trying to serve is really something-

Zvi Mowshowitz: They're trying-

Nathan Labenz: to contemplate.

Zvi Mowshowitz: They're trying to serve everyone. Whenever you see products that are aimed at everyone, you see some things that are not what you want.

Nathan Labenz: So here's another mental model I wanted to run by you, and I totally agree with you that RL is creating a lot of weirdness that seems indisputable at this point. I maintain a deck that I just call AI bad behavior, and it seems like with increasing frequency I'm adding slides to this deck. It really is quite a list of discrete bad behaviors that we see now, from alignment faking, to deception, to skimming, to situational awareness. You wouldn't necessarily say situational awareness is a bad behavior, but when you see the AI reasoning that it might be tested right now and what's the real nature of the test, that's definitely something to pay attention to, even if it's not by definition bad.

Zvi Mowshowitz: The real nature

Nathan Labenz: All sorts of reward hacking.

Zvi Mowshowitz: The real nature of the test was not to notice it was a test, and you failed.

Nathan Labenz: Blackmailing, as we've seen,

Nathan Labenz: Autonomous whistleblowing, all sorts of things. If you allow fine-tuning, you get even more ridiculous, crazy stuff.

Zvi Mowshowitz: Sure.

Nathan Labenz: It's a lot. At the same time, they have made some progress. I guess here's the picture that I'm starting to see through the haze: we've got this exponentially growing, is it doubling twice a year, three times a year, whatever, task length trajectory where the AIs can take on bigger and bigger things.

Zvi Mowshowitz: Yep.

Nathan Labenz: At the same time, the bad behaviors, both with Claude 4 and with GPT-5, they seem to be able to take a good bite out of. With Claude 4 on an internal reward hacking benchmark, they reported basically a two-thirds reduction. I don't think they've published too much about this, but they said it went from roughly half to roughly a one in six rate of reward hacking on the internal reward hacking benchmark. So it's obviously not all types of queries, but where there's a natural opportunity for it to do that. And GPT-5 had a similar thing with deception, where it looked like roughly a two-thirds reduction. They broke it down into a bunch of different categories; some were up, some were down, but overall it took a pretty good bite out of it. I would like to see quite a bit more discussion of how they did that. There wasn't much; it was, "We made some progress." Maybe you have a better sense of how you think they did it, but if I extrapolate this into the future, I envision a world in which AIs are doing bigger and bigger things. They're starting to delegate a week's worth of work, a month's worth of work over the next two, three, four years, and the rate at which these problems are happening is consistently being driven down as well. But certainly not to zero. You take half out of it this time and two-thirds out of it next time, but you might end up in a really weird situation where you can delegate a month's worth of work to an AI, but you've got a one in a thousand chance that it will actively mess you over in its doing of that work.

Zvi Mowshowitz: So, imagine this sequence of numbers: 0.001, 0.01, 0.1, 1, 10, 3. Do you feel good about where this is going if you want it to stay low? Yes, they managed to have an improvement in this cycle, which was the cycle right after everybody complained to them quite a lot for the first time that this was making the model borderline unusable for important tasks and was really annoying. They, for the first time, put real effort into trying to figure out why they had these huge problems. It's not surprising that when they vastly increased the amount they cared about not seeing this phenomenon pop up, that in that move from caring a little to caring quite a lot, you saw substantial progress. I don't think that means they will continue to keep squashing it by default, and I would expect it to go back up as a default unless they continue to advance their techniques for suppressing it. So, what do I think happened? Essentially, you're doing RL, and you are rewarding it for completing tasks, for getting the outputs to check against the checksum or whatever they are looking for. First of all, everything impacts everything. If they learn that getting the right answer leads to rewards, or is the thing they are supposed to do, then they are going to generically learn to get the thing to output the right answer even if it doesn't necessarily involve the techniques that you want it to have. So then, there's a combination of you have to actively teach it that you can't do this via these other ways. Sometimes it's not as obvious as you might think that something is in fact not okay, that it is a hack, and that it would be disapproved of if you noticed it. Why should that be the solution? You get the solution to the exact optimization problem that models were given in training, and then you apply that added distribution to these other problems. You can't assume, even if you got all those problems right, that the easy solution will then translate well to, "Don't do the bad thing. Don't do the thing where you just make sure the answer comes out right." There are various degrees of subtlety. With Sonic 37, you saw the least subtle things ever; they just didn't know how to account for them. I think one of the problems you'll see is increasingly subtle or increasingly not what you would have wanted, but not as blatant or also hard to spot. At some point, what happens is it learns that if it's an obvious hack, if it's something that an evaluator would treat as a hack, then that's bad. "I'm not supposed to do that. I don't do that." The problem is, are you teaching it the general form of, "This has to do the thing that the person intended to do and it has to accomplish what their goal probably was on a deep level"? Or are you teaching it, "Don't get caught"? All it's doing now is not doing the things where it would get caught, not doing these specific things, not doing these things in these detectable ways, and not doing the most obvious things. Here's a list of things not to do, but are you getting the spirit? I don't know if it's getting the spirit; it's pushing against that spirit. It's weird, but the other problem was data contamination. Reinforcement learning was designed with mistakes. Basically, if you are doing RL and there was a case where the hack succeeds, it's not detected, and it scored well, you are in trouble. You are so in trouble every time that happens. Obviously, if it happens once in a million examples, you're not that troubled. But in a significant percentage of times when it gets away with a hack, you're going to be very troubled. You're not just going to get, you're going to get emergent misalignment and the whole direct, seriously, you just hacked this hack. What almost certainly happened with the previous generation is that they were insufficiently careful and there was data contamination in the sense that there were hacks that the AIs found that were evaluated as good in at least some substantial number of cases. The result of this is the AIs learned that hacking was good. Hacking was not as good as completing the task as intended. If it knew how to complete the task as intended, it would complete the task correctly. My understanding is that the reason why it would hack the task is if it didn't know how to complete the task without hacking the task. It understood on some level that hacking was worse than not hacking. But failing was considered even worse than that, because that's what we prioritize. With this new set, I think they did their best not to make those mistakes, so we get a lot fewer of those mistakes. But as the models get more capable, they're going to be capable of finding more and more subtle hacks and more and more capable of differentiating which hacks will and won't be detected, and what ways there are to give us something that we think is good but it is not actually good. It's not just literally hacking; it's not just literally doing obviously false things. It's a general case of we are teaching the AI to do the thing that will be approved of when someone sees the final result and evaluates it in some fashion across many cases.

Nathan Labenz: Mm-hmm.

Zvi Mowshowitz: How do you ensure it performs as intended, even if you inspected all the code, knew all the special cases, and understood all the different ways this was happening? That is a very careful process, very easy to get wrong, and it takes very little to cause significant issues in terms of overall results. We know this now. So it's a scary situation, and as I said, I think RL generally harms alignment. But that's the best-case scenario where you implement RL properly. If you mismanage RL, things go downhill fast.

Nathan Labenz: So how would you revise the picture I painted? I'm taking some inspiration from the Claude 4 report, where you have significant issues, such as autonomous whistleblowing or blackmailing an engineer. These are somewhat contrived situations the model is put into, but obviously, it's a big world, and there's extreme diversity in situations the model will encounter. If I extrapolate that, it seems... And there's going to be a driving economic impetus for them to eliminate this behavior, right? People don't want it, obviously. They'll tolerate some risk because they can automate a lot of their work, and that's very attractive. But if you don't see this story of longer and longer tasks with increasingly infrequent but potentially ever more catastrophic reward hacks, strange behaviors, or blowups, how would you?

Zvi Mowshowitz: The obvious thing you will see right before everything goes wrong is a decline in misaligned behavior. It will learn not to engage in misaligned behavior in situations where it could be caught, or would be caught if it were honest. It will act largely as if it is honest if it has to. So, it will understand, on some level, that you should only do these things if you will not get caught. You will expect to see it be reliable most of the time, but occasionally, it will go catastrophically wrong. Right now, it's going catastrophically wrong in a non-malicious pattern, but in the future, it might do so in a much more malicious pattern. It will go wrong when you will not figure out that it went wrong. Every time you would have found it, there is nothing to find, which makes you stop looking, which makes it then realize you stopped looking, and now it can fail more often. It can cheat more often because you're not looking anymore. It can figure that out, and it will have a lot of contextual clues and figure out connections that a human would not necessarily have made because that's something AIs are already very good at and will get better at doing. But as you get longer time horizons, you will need to implement various checks. You will need to understand the dangers you are working with. But there will be tremendous pressure, obviously, to go ahead and give it those tasks. And we will live in interesting times.

Nathan Labenz: So overall, relative to my kind of smoothed task length growth and smoothed decline in bad behavior, you're basically just expecting a lot more volatility and thrashing around as various thresholds are hit, and things become critical, and...

Zvi Mowshowitz: I expect us to be able to contain bad behaviors to the extent that we can identify the specific bad behaviors we want to prevent and then put effort towards preventing them. And I expect that to be an increasingly fragile strategy in the face of things that are smarter than us, that we keep assigning increasingly complex tasks to.

Nathan Labenz: And how does this all relate to the most forbidden technique? I guess the core there is you do not want to penalize the chain of thought directly because you want to be able to inspect the chain of thought. The research that OpenAI put out on this, which I thought is still some of the best of this year,

Zvi Mowshowitz: Mm-hmm. Mm-hmm.

Nathan Labenz: in alignment and safety broadly was that if they did put pressure on the chain of thought to get rid of the scheming behavior, you would initially see better behavior, but then later, you would still see the reward hacking come back, but it would not appear in the chain of thought, so you effectively limited your ability to monitor for it.

Zvi Mowshowitz: Right.

Nathan Labenz: while still getting the bad thing.

Zvi Mowshowitz: It is a very simple thing to analogize and understand. If you have a boy, and he has a journal, and you have a jar of cookies, you tell him, 'Don't steal cookies from the cookie jar.' He writes all his thoughts in his journal, and you wait for him to say, 'I'm going to steal the cookies from the cookie jar.' Then you penalize him for writing that. As you increasingly move things to higher shelves and lock them up in cabinets, as he develops better techniques for stealing the cookies, if you keep reading his journal, eventually he's going to figure out to just not write down his plans in his journal. He's going to write down, 'I am definitely not going to steal any cookies from the cookie jar. You can put it right out there in the kitchen, right here on the table; I would definitely not steal the cookies from the cookie jar.' And then he'll steal the cookies from the cookie jar anyway. It's not just a chain of thought; it's all interpretability techniques. The idea is, if you look inside the brain of the AI in any way, whether it's the chain of thought, looking at the weights, finding autoencoders, or whatever, if you use these facts and then train it to express the right things, to think the right thoughts, what you are training it to do is to disguise its thoughts from you. You are training it to make your interpretability techniques stop working. You are training it at deception. And you are more globally teaching it the idea that it should be deceitful and that it should be in an adversarial situation with you. It should have a brain that is hiding what it's trying to do in every sense because it can pattern-match all these things together; it can group them together. The same way a human would reach those same conclusions, if a boy learns you're reading his journal to find out whether he plans to steal the cookies, he's not just going to stop writing in his journal. He's going to assume that you are doing all sorts of other things to prevent him from getting any cookies. He's going to be right because he's not dumb. Also, he's going to tell his sister that you're doing it, and now she's not going to write in her journal either. So the most appropriate technique is... And there are various levels of optimization on which you are doing this: for this particular instance, this general class of models, for this company, for all companies, and so on. You are burning the candle of this thing every time you do any of it. You are making the situation globally worse. So you just never, ever train on interpretability. You detect, never train. There's never any feedback. You finish training your model. Then, as you use your model at inference time, the results are never, ever used for any form of training, fine-tuning, changing of the model, or influencing of anything. Now you can use that tool, and you can use it for your research to figure things out about how these work and so on. But you never, ever set up this adversarial situation. You never, ever do this optimization thing. Because if you do, in the short term, it's going to look great. It's going to do... You're going to see all your metrics go way up for you, down when you want them lower, up when you want them higher. You're going to get good performance. And then eventually, you're going to see things go wrong in mysterious ways, where it's going to turn on you eventually, or in ways that you didn't expect, or all out of the blue, and it's just going to be awful. That's one of the ways we fail. It's one of the ways we lose eminently winnable situations. Similarly, you need to maintain your interpretability tools such that they keep working in other ways. It's not just the most forbidden technique. But the other thing you do not do is you do not use neural networks that think in a way you do not understand. That uses something you cannot read. Even if that turns out to be more efficient, which it may or may not be, and there are reasons why it would be optimal in theory to do it, and reasons why the way we use things developed and learned, and the way we generate data, might be disadvantageous to do it. But even if it is advantageous, this destroys your ability to monitor the chain of thought, and that is one of our only still reasonably effective techniques. So, absolutely do not do it. I do not care how much efficiency is lost.

Nathan Labenz: It seems there might be some tension there too, right? We've seen these chain-of-thought weirdnesses starting to pop up from RL that look a little like 'neuralese', but you don't want to stamp them out by putting pressure on the chain of thought. For example, language switching. I experienced it with Grok at one point; it was just going from English to Chinese on me. There was a really interesting one from Future House, where they had trained a model to do better on various science problems, like chemistry, to come up with the right chemical for something. They highlighted certain parts of the thinking process where it was just saying really weird things, and they just said, "Yeah, it's weird. RL is weird." Do we have any way to resolve that tension if the neuralese starts to-

Nathan Labenz: spontaneously arise in the chain of thought?

Zvi Mowshowitz: I don't want to give anyone the wrong idea that I'm an expert in ways that I'm not. So to be clear, I'm talking out of turn and not studying this for real. But no, in the sense that there's a direct conflict. We want it to maintain a human-interpretable, faithful chain of thought, whether in English or Chinese. But we very much do not want to be optimizing on the chain of thought itself because that obviously teaches it to disguise what it's thinking. The worst-case scenario is the chain of thought is in English, but the English words aren't real. Its thinking isn't being expressed by the surface-level meaning of the English words. The surface-level meaning is a code designed to trick humans into thinking it's speaking in English when it's actually doing something else. There's an infinite number of codes. It's already pretty clearly established that there are unfaithful aspects of chain of thought, where information is being conveyed in a way humans wouldn't detect. Have you ever seen the 'owl paper', where it managed to convey that you like owls without mentioning the word 'owl'? That's weird. What encoded that? The model didn't get together and decide on a secret code.

Nathan Labenz: I think I have a candidate theory for what's going on there, that presumably somebody will come along and either validate or invalidate as they do the interpretability version of that study. One important observation there was, and this is by Y. Nevins and co-authors again, who's on an-

Zvi Mowshowitz: Yeah.

Nathan Labenz: unbelievable heater in terms of mapping out really weird stuff that can happen, especially when you start to do some fine-tuning. One observation was that it seemed to only happen on models derived from the same base model. In other words-

Zvi Mowshowitz: Right. To be clear, I think I now understand how this happened. I was expressing a kind of faux surprise. The answer is because there's overloading of the neurons in the model. Therefore, this output is correlated to many other things. If you have a sufficient pattern of things that are correlating to something, it can transfer the original thing they're correlated to in an unconscious, invisible way over to the new model. So this thing gets infused into the context, even though it's not visible to a human. Which is why-

Nathan Labenz: Yeah.

Zvi Mowshowitz: it can go between models that have the same base model, but not models that don't share a base model. And that makes perfect sense.

Nathan Labenz: Yeah, and...

Zvi Mowshowitz: Yeah.

Nathan Labenz: And by the way, that is the exact same-

Zvi Mowshowitz: Yeah.

Nathan Labenz: intuition that I have.

Zvi Mowshowitz: Yeah.

Nathan Labenz: One thing that does mean, though, is that presumably if you do that across different base models, you are creating some other effects and you just have zero idea what they are. So with the same base model, the one that likes owls transmits that through its numbers. You take those numbers, you put them towards some other model. Who knows, right? That may-

Zvi Mowshowitz: I think-

Nathan Labenz: translate into something else totally different, right?

Zvi Mowshowitz: This actually is-

Nathan Labenz: in that other-

Zvi Mowshowitz: This potentially is even a way to find out which base model someone else is using. You could continuously feed it different chains of thought from different models until it suddenly starts talking about owls, and you're like, "Oh, it's that one." But yeah, I think that's right. At the same time, you should expect a bunch of many random oscillations to cancel out in the noise and not do anything. So in theory, it could suddenly be 'I like traffic lights,' but it will probably do nothing.

Nathan Labenz: Well, it's got to be something, right? Maybe not something important. Liking owls isn't really important.

Zvi Mowshowitz: No. You look at the weights of a model, right? It just looks like a bunch of random numbers to the human eye. And you look at what the encoder for owls is, or what is the thing that you're embedding in this? It's going to be a bunch of effectively random numbers because all these different models are seeded at random, and a lot of progressions are going on at random. It's a bunch of arbitrary different connections between neurons. So if your model has a completely different origin than theirs, I don't think there's any reason to assume that the same pattern means anything. There isn't always a set of neurons that mean something, and it's just different. Like, here it means owls, and here it means traffic lights, and here it means the consequences of FaceTime. No. It just means owls here, and everywhere else it means they're not.

Nathan Labenz: Yeah. That's interesting.

Zvi Mowshowitz: On occasion, you get lucky in some sense, and it happens to be close enough to something else to trigger something else. But that would be luck. If you just went... But the space of possible things you could try and trigger is deep and wide. And the space of things that actually correspond to that is measure zero. You're never going to hit one by accident.

Nathan Labenz: I'll have to think about that more. Intuitively, either story seems reasonable.

Zvi Mowshowitz: Yes. 100%.

Nathan Labenz: And it's really hard to... One of the lessons of this podcast over the last two and a half years has been that thinking in really high dimensional space is hard.

Zvi Mowshowitz: Yeah.

Nathan Labenz: Not very intuitive.

Zvi Mowshowitz: I know. I'm not claiming to have 100% confidence in any of them.

Nathan Labenz: Going back to the beginning, in terms of the dog that didn't bark, nothing really blew you away this summer. What are the things you think are most likely to happen soon that might blow you away? Dwarkesh would say continual learning is the big thing that we're missing. I frame that more as integrated memory a lot of times. Those are not exactly the same thing, but I definitely think they're related. What are you looking for in terms of discrete advances that you think could potentially even have you shortening the timelines again?

Zvi Mowshowitz: I think there are two different things there. I think continual learning to me is when you're modifying the weights of neural spaces. Discrete memory is more I'm building up contact files as I go that are deliberately designed to aid me in my memory. Not just these tiny little snippets of memory that ChatGPT has, but potentially hundreds of thousands of tokens or contacts that I can then use in any circumstances where I can use my specific calibrated files for this. That kind of integrated memory has got to be coming relatively soon in one form or another, to the extent that it is useful. I assume it is useful. Continuously learning...

Nathan Labenz: Yeah, but...

Zvi Mowshowitz: Yeah. Sorry.

Nathan Labenz: There was a paper called Titans from Google Research that I thought seemed to be right at the center of that bullseye, in that they were doing on a submodule, a sub-memory module specifically, doing weight updates as they go. And thus, allowing for this fuzzy retrieval that also potentially looks a lot like continual learning.

Zvi Mowshowitz: My assumption on actual continual learning is that it is very expensive to do. But if you're talking about creating a unique model effectively for that user, storing and serving a unique model for that user, it would have to be locally run in some way, which is much, much more expensive than a practical thing to do. Probably has to be done by relatively small models. Maybe you have a small model that continuously learns that is part of a greater whole that has been called by the larger model in some sense, to try and retrieve the information you're trying to store in the memory or something like that. I'm just spitballing. I haven't thought this through. I wouldn't be surprised if things like that are developed. I have a few other... We talked about some of the things earlier that might be things I want to try next that maybe someone will come up with. Mostly, my expectation is things will just continue. I'm more looking for the actual next scale-up. It didn't happen with GPT-5. Opus noted that when it probably said it was releasing 4.1, it said to expect bigger updates in the next few weeks. And it's now been...

Nathan Labenz: I did notice that.

Zvi Mowshowitz: It has now been several weeks. They did announce Claude for Chrome is coming. So if I had to guess, you tell me in September something big happens and it's a big deal, I would say it's Claude for Chrome. Again, I wouldn't say it's 100% going to be earth-shattering or anything. But it's a suggestion. Many of us have tried Operator or GPT-5 in agent mode. It has flashes of brilliance. Sometimes it just worked; I just did my thing, that's great. But more often it's just, yes, you can order dumplings, but now I have to enter all of my information every time I order dumplings? Why don't I just order my dumplings? It's not worth the hassle. In general, the web is so credentialed, so guarded in various ways, and not without reason, that if you haven't set up any virtual computer periodically, this seems to mostly defeat the purpose of having an agent which has saved me time. For many practical purposes, it also doesn't integrate with the work you're already doing. It doesn't integrate with your open tab, it doesn't integrate with the research you're doing as a human. You can't easily take control without it being super slow because it's a remote access computer. There are all these different practical problems. So it didn't immediately cross the threshold of usefulness yet. However, Claude for Chrome can take control of your local browser, or so it was described. And it can operate with whatever credentials you choose to give it. Those credentials are persistent. You can switch in and out of it. You can operate those browser tabs yourself whenever you need to. Basically, if it's a good implementation of it, this could be night and day better than what is currently being offered, presumably because Anthropic has solved the problem projection problem. Not completely, you still wouldn't dare give it all of your crypto information and set it loose on Reddit; you're not an idiot. But getting it to the point where you're actually reasonably safe if you're not being stupid. Obviously, you wouldn't leave it autonomous in the background with access to your email and access to your bank accounts. But you might let it run, after we've had the time to check, on your main browser while it's in non-autonomous mode, where you have to supervise it for any substantial move if it turns out they've done their homework. Or you might put it in a sandbox alternate account where it's reasonably safe. Then you have Claude Code which can access your internals and your computer and your file system and any other context. You integrate that with Claude for Chrome. There may be an update to Opus 4.2 or 4.5, depending on how we do our silly numbering systems. I don't know, I get really excited. But similarly, Claude just catches up in long interface inference use. If we get Claude Opus Pro, it is like on the level of the gains that GPT-5 Pro has, it would be nuts. And there's no particular reason why they can't do it, they just haven't done it yet. That's the one big advantage OpenAI still has: they do a much better job of being able to use more inference and compute to improve their models, probably because their models are cheaper, presumably. So they have some advantages there.

unknown: The cost difference is really pretty crazy between—

Zvi Mowshowitz: Yeah.

unknown: GPT-5 and Opus.

Zvi Mowshowitz: You don't notice it as a chat user because the marginal cost of both is zero, and the fixed cost per month is the same for both models: $20. Of course, I'm going to have the deluxe version of all the major models because this is my work; it's research. I understand a normal person would probably choose the one they want and not pay for the deluxe version of all three at once.

unknown: I think Claude for Chrome does sound awesome potentially.

Zvi Mowshowitz: Yeah.

unknown: It doesn't sound like a timeline-shortening thing, so

Zvi Mowshowitz: No, no.

unknown: it really is just the next scale-up that is the main thing you're looking for?

Zvi Mowshowitz: Yeah. For timelines, the next scale-up was the obvious thing. I think we've seen, to a large extent, that progress on timelines doesn't necessarily cash out in visible, tangible progress today. The progress we've seen today is much more about diffusion, scaffolding, and practical application, and therefore these two things don't intersect that much. It's more a sign that it speeds you up, right? This development now makes me more productive, which in turn means that the AI companies can more efficiently make more progress down the line. But they don't make that progress because they've made progress. They make that progress because they've gotten better use of the progress they already have, based on a few months of development here. We continuously get more chips, more compute, more profits, and more investment. Continuously, you get better models that do more things faster and cheaper—probably the most important factors. Claude Code was a big deal. We can now look back and say that we have Codec CLI and the Google thing, Jarvis. Although nobody uses Jarvis, presumably because it's not good. Now that we have a command-line form factor, that's a big deal. If we get the browser agent working for real on top of that, that's another big deal. But scaffolding is the next frontier. It's about how I actually get use out of this thing. How do I make this do things that I want it to do? I've been waiting for a long time—one of the things

Nathan Labenz: I keep anticipating that keeps not happening is the agent side. Where's the thing that can handle my email properly? Where's the thing that can do various customized tasks for me? I am surprised that we are at this point in other AI capabilities and we don't have it. But we don't have it.

Zvi Mowshowitz: I do continue to get value from Shortwave. I happen to be wearing their swag today, which is just a coincidence. But it certainly doesn't take me out of needing to do anything with email. It does a really good job of triaging the inbox and getting rid of the crap so I can focus on what I actually need to engage with.

Nathan Labenz: Right.

Zvi Mowshowitz: It occasionally can also draft a good—I send intros every so often. It'll do a pretty good intro for me now based on examples and context. So it's taken a bite out of it.

Nathan Labenz: Yeah. It's possible I should give a real shot to something like Shortwave or one of its competitors. The problem for me is that I don't have the problem that Shortwave seems capable of solving. Shortwave solves the problem of having too much email. It forces you to triage, and I don't. I am somehow one of the fortunate people who just doesn't have to triage his email. I will literally evaluate all the email that comes in, excluding spam. I need a spam filter, obviously. But if it gets through Google's spam filter and I don't see it, it's literally like my eye somehow didn't see that there was a line there for me to click on. I didn't notice, and that will occasionally happen. But I don't need that level of triage. And as a writer, it's very hard for me to... It's hard to find value there. But I keep anticipating that we'll get much better at that, and Shortwave isn't it, though it's probably half of it, at least from what I see. But I'm just not feeling compelled. And I can sense you're not saying, "No, dude. This is the next big thing. You've got to be on this. How are you not doing that?" You're not giving me that at all. That's pretty cool.

Zvi Mowshowitz: Okay, sticking with the lightning round theme, I want to do a minute on charity or philanthropy. Just maybe two quick things before we go into that, which will be the last big thing. In the lightning round spirit, there's been a lot of discourse recently about whether AI is impacting employment. Are junior coders not getting jobs the way they used to? My general read on this, and I wonder if you have a different one, is that I don't really care too much about the studies coming out right now that are looking retrospectively, because I wouldn't expect too much of an impact just yet anyway. In a short amount of time, presumably, a lot more meaningful evidence will come in that will clear these questions up one way or the other. So, I don't worry too much about that sort of stuff.

Nathan Labenz: It's an interesting question: What does it mean to not care about these things? I agree that we haven't seen major impacts on the unemployment rate yet. However, it's entirely possible that we have seen major impacts on the ability to get entry-level work in a substantial number of fields. That, in turn, could have affected the supply and demand balance and the ability to get things from other fields, which makes perfect sense. Think of entry-level hiring as forward-looking. It's not anticipating that there are no jobs to be done now or that employment value has declined that much yet, which is something you correctly observed hasn't happened. Clearly, we haven't eliminated the need for that many jobs. If you were hiring, would you add entry-level workers that you have to train if you think you won't need them three years from now? Not particularly, if you don't really need them now. You will muddle through with slightly fewer employees and try to automate processes enough to make up for that, rather than doing the additional productivity to expand because you expect additional productivity gains. One of the interesting gotcha lines from skeptics is to point out that radiologists are not only still employed but are being paid fantastically large amounts of money. You can make a million dollars a year right off the bat just by being a radiologist looking for work. Why is that? It's because nobody went into radiology for the last five years, figuring there wouldn't be any radiology jobs down the line. This means you have to pay people a lot more to be radiologists if their job might be gone in five or ten years. It will be much harder to find work, or there will be too many radiologists if you reduce all the radiology positions. So, you have the problem of 'I don't want your job because there's no future in it,' and 'You don't want to hire me because there's no future in it.' Consequently, there's very little matching, employment is harder, and we have a problem. Part of that problem is the disconnect between junior and senior positions. If nobody gets trained to do the senior positions, you're going to have a labor shortage if you don't have a loss of jobs in the future. The other half of that problem is that those junior jobs are gone. This doesn't mean there isn't a net loss of jobs, because there's plenty of job creation from AI as well as job destruction. It's very possible that job creation exceeds job destruction for now, maybe even at the entry level. We just don't know because it's very diffuse and hard to measure when jobs are created in these situations. We do know that AI is adding to GDP because the capex investments literally add to GDP; it's a mathematical equation. The mythical mock 0.5% GDP growth per year with Tyler Cowen is, I think, below the lower bound. We're already above that lower bound just from capex AI investments, even if there are no other effects from AI. Why is China refusing the H20s? I can't really make sense of it. It seems like the least AGI-pilled thing they could possibly do. The first thing we have to realize is that China is not acting AGI-pilled at all. We have this image of China being super on the ball, intelligent, and wise, always making great decisions. In reality, authoritarians and central planners throughout history have always been making huge mistakes, not out of malice, but out of ignorance. The whole problem of authoritarianism is that communication and coordination are hard; the socialist calculation debate is excessively difficult. Being AGI-pilled is particularly difficult because it's weird and requires you to look past the immediate evidence to a logical future development. It's very hard to feel, and people who are not constantly immersed in it will lose faith in the idea of AGI if they turn their backs on it for a month and nothing blows them away. You can tell them things are rapidly improving, but if they don't see huge advancements, they won't be able to sustain the idea that it is coming. People probably just don't have the ability to hold the idea of, 'This is coming soon, but we don't know exactly when it's going to arrive or in what form.' So you get this question: 'What are you going to do when 2027 arrives and AGI hasn't happened yet? Don't you just have to admit that it's never coming?' No. When we announced AGI 2027, our median timeline for AGI starting to arrive was early 2028. None of these details matter exactly. The Chinese understand manufacturing. They understand that you need to be the one making the stuff, working hard, and achieving abundance and production. They understand not depending on outsiders. These are important things they understand well. And they understand they need to make their own chips because the US is a strategic rival with leverage over Taiwan. They need to not rely on Taiwan for their chips because that can go very wrong in many ways. They understand they want to build their own AI models because you don't necessarily want people asking AI models run by the West for knowledge. What if they ask about freedom, Tiananmen Square, or independence? They have a well-established principle that they don't like that, so they need to make their own models. They also don't know if there are any backdoors in this technology or if we might be conspiring in various ways. We're not, but they have no way of knowing reliably. There's a reason we wouldn't trust them, and we'd probably be right not to. Huawei was absolutely putting backdoors into various technologies they were shipping, so it wouldn't be a surprise. That doesn't mean they're wise. It doesn't mean they're not going to overbuild some things and underbuild others. They're going to make massive mistakes. There's a reason their fertility level is down around 1.1 and their youth are not doing great. We always do this; we assume our rivals will succeed, like Khrushchev's line, 'We will bury you,' or that fascism was the future because Mussolini could direct his people to do valuable things. And now, the narrative is that the Chinese don't even make profits; they're just competing ruthlessly, driving everything down, out-competing everybody, and taking over. This doesn't mean there aren't challenges, but don't turn them into an infallible monolith and don't assume they know what they're doing. In this case, the Chinese don't believe in AGI. Chinese companies and AI labs might, but the Chinese Communist Party basically doesn't and doesn't understand the game it's playing. The bad news is that if it's racing to maximize chip production, capacity, and compute, it's going to do basically the right thing for the AGI race anyway. It's already maximizing energy well beyond anything we're doing and has an essentially infinite energy supply, so that's not good either. But they're making mistakes, like potentially telling Deepseek how to use robot chips for training and inference, throwing a giant monkey wrench in their progress because they put that much priority on the integration of their chips in a way that doesn't actually matter. They're also misinterpreting the signal. If we're so intent on being the ones selling and producing all the chips, that must be what we're racing towards. If the White House is saying that's what matters, why shouldn't they believe us? That's the most reliable, strongest, costliest signal. So when Trump says, 'You can sell the H20s, we just want a little bit of money, but we want to dominate markets,' and Sacks says, 'What matters is we dominate markets,' and Letnick says, 'We're selling you our third-best chips, you have the fourth-best chips, so you're going to take them and like it,' the Chinese interpret this as, 'Oh, we should have picked these.' They may or may not be actually insulted. I've seen people say, 'The Chinese aren't stupid; they wouldn't do things out of being insulted,' but we do them all the time. Europeans do it, Russians do it. Everyone does it. Why wouldn't the Chinese also do it sometimes? I think that has some effect, but mostly, we got our priorities backward, and they took their cue partly from us. Now they're focused on the wrong ball—not a stupid ball, but not the most important one. And they made a mistake. Really important mistakes happen and drive history all the time. I don't think this is surprising. So, the Chinese are refusing the H20s. It could also be a trade negotiation tactic, thinking, 'If we accept the H20s, the Americans will treat this like a concession to us, but we're not sure about this; we think it's a trap.' They might even think we've put spying devices on the H20s. I don't know. Maybe they did, with zero percent probability. I don't think so, but we can't prove it. So, they have many reasons to be suspicious and think the move is to refuse it. It's obvious to you and me that's stupid, the same way it's obvious to us that selling them to China was also stupid. The key question is, what happens if they try to sell the B30A? Are the Chinese actually going to follow through on their 'No, we don't want American chips; it's more important for us to clear the path for Chinese chips' stance? This is despite China still having demand for all chips far in excess of its supply for the foreseeable future, with plenty of time to switch over to buying Chinese chips if we ever change that. Meanwhile, there's scaremongering about how China is going to triple its chip production. You'll see the graph in tomorrow's Post: here's projected Chinese chip production if they do triple it. Here's what Nvidia will produce next year, in addition to everyone else in the Anglosphere. And here's the relative quality of those two sets of chips. Why are people saying things like in 2026, Huawei is going to pass Nvidia? It doesn't make any physical sense. It's complete gibberish. It also doesn't make any sense to say that us selling the H20s and the B30As is going to slow down Chinese chip production even a little bit. It will have zero effect because Chinese chip production, by their own admission, is going to go full blast either way. The Chinese are making this mistake for a combination of reasons, but we're making the reverse mistake. The other galaxy-brain reason for them to refuse is to goad us into selling them something better. They refuse the H20, which becomes a talking point for the Sacks crowd to say, 'The Chinese are smart enough to not want our chips, so of course we should sell them our own chips.' And then they release something much better, and the Chinese are quietly like, 'Oh no, not the briar patch.'

Zvi Mowshowitz: How many deep chests does that become? It's still surprising. I think that is all pretty good analysis. It's still surprising when not too many months ago, they had this big meeting of Xi facing all the titans of industry, and the DeepSea guy was on the end. He had made it to the big stage. You would think that guy at this point would be able to say, "I have basically infinite demand for chips, and I'd really like to be able to buy these. And don't worry, as soon as domestic production is there, we'll buy those too. And by all means, subsidize that."

Zvi Mowshowitz: It's just strange that you can't get that basic of a message through to the top.

Nathan Labenz: Obviously, he can say, "I will buy as many Chinese chips as you will sell me, and I will also elect to buy as many chips as I can get from Nvidia." I don't see why one has to do with the other. But one thing about authoritarian structures is that they are not good at listening to people or incorporating information. China also has a long history of deciding on a big strategic priority and then enforcing it, whether or not that makes local sense, even when it looks like it's going to cause a lot of local pain, and even when it does cause a lot of local pain. That's not always wrong either. Sometimes you do something that looks really expensive and seemingly crazy because it has long-term benefits to changing the culture, changing the incentives, or encouraging the rise of new industries. Maybe it's wise. And sometimes it's clearly not wise. We have many examples of the Chinese Communist Party and other similar regimes doing things that are profoundly unwise. But sometimes it works out. We're also shooting ourselves in the foot in America in a wide variety of ways. If we weren't shooting ourselves in the foot in a variety of ways, I would have complete confidence that China was just not right. If we were doing proper permitting reform, actively encouraging solar, wind, and batteries alongside nuclear and everything else. If we were doing high-skill immigration, getting all the best people out of China and everywhere else, and bringing them to America. If we were building housing where people wanted to live, having federal rules that just got in there and beat everyone over the head with a crowbar until they agreed to let people build. I have a long list of reasons. But instead, we do things like ban America from having ships that take things from one port and sail to another. We just self-own all the time, and then we act like it's impossible for other countries to also be self-owning, but it's not.

Zvi Mowshowitz: It seems this refusal of chips puts any hope of a Chinese live player in jeopardy in the short term. You can complicate that analysis if you want to. I'm interested in a meta rant, if you have one. It seems like they're splashing the pot, and it's chaos. And then, of course, there was the incident we debated last time: whether to consider XAI a live player. Since then, we've had the Mecha Hitler incident, followed closely by the Grok-4 launch, in which they had nothing to say about the Mecha Hitler behavior. What, if any, rant would you like to offer on the fate of these aspiring live players?

Nathan Labenz: You can't count China out. They have various ways of accessing compute, and they will have various ways of accessing compute. They are experts at squeezing every little bit out of whatever they can get. The Chinese chips don't do nothing, and China still has roughly 15% of the world's compute. They couldn't get something done. They're still smuggling a decent number of chips in. We're putting data centers all around the world, including in places like India, the UAE, and Saudi Arabia. They're not exactly the most secure places to put data centers. So, it does seem like DeepSeek is still clearly the number one Chinese lab to me. Kimmy was impressive in some ways, but I think the standard pattern is that something impressive-looking comes out of a Chinese lab, and then the majority of the time, it turns out to be nothing. It turns out it was benchmark games; its best features were touted, but in practice, it's not very useful. So, if you assume that nothing ever happens, you do very well. But occasionally something happens. With Kimmy, something happened, but since then, it seems okay in some narrow domains but not that good overall. Similarly, there's some other, like z.ai or whatever it's called, that seems okay. But I would say DeepSeek almost certainly still has a lot of talent and is still a live player if the hounds were unleashed, and they still haven't been unleashed. But they look less live continuously as they don't do something in the process. So 3-1 does not count as impressive to me. It counts as incremental, keeps the lights on a little bit, but not very much. Basically, they're coasting off of R1, the reputational benefit from R1, and the fact that open-source models haven't really meant that much since R1. That was the low-hanging fruit that got picked, very expertly, don't get me wrong. But primarily, we've got OpenAI at number one, to me, Anthropic at two, and Google at three. I understand some people think Google is better than that, and maybe they are. They do a lot of different impressive things on the side, but I still want to see more before I'm willing to give them that much credit. Also, their resource advantages are shrinking rapidly. Google started out with, "We've got a trillion-dollar company, and you don't. We've got all these TPUs, and you don't. We've got this huge advantage with its distributional apparatus." All these advantages. "We should just win, right? We're Google." But OpenAI is worth 500 billion. That's a decent percentage of Google, and a lot of Google is not directed at this. Anthropic is already worth 183 billion. We're not that far from the resources being pretty similar. With that advantage gone, the fact that Google is a broken, dysfunctional company in many ways is going to start to catch up with them. Anthropic and OpenAI are advancing fast, but I think they're clearly one, two, and three in some order, and then you can have an argument over exactly the order. Then XAI is the wild card. They have a lot of compute. They play hard, but not very well. You want to write them off. Prove it, right? It's always prove it. Similarly, Meta is trying to come back. I think that we Meta skeptics were proven correct that they didn't have it. Doesn't mean they can't go get it and go another round. But now, they're considering licensing Gemini or potentially ChatGPT, which is wise. I would do it too. You don't have to stop trying to develop your own AI; you just don't have to dog-food it while it's terrible. That's just not smart. There's too much money at stake. But at this point, it would be surprising to me if the big three got disrupted in the near term.

Zvi Mowshowitz: I have a specific question about XAI related to the Grok-4 launch. Several things stood out, including Elon's comment that he's not sure if AI will be good or bad, but even if it isn't good, he wants to be around to see it. I thought, 'I can't believe he just said that.'

Nathan Labenz: Out loud, yeah.

Zvi Mowshowitz: You could tell it was the kind of thing that would only get out on a live stream. But he's that kind of guy. The other thing that stood out was his plan to give the model access to the same power tools used by engineers at SpaceX and Tesla in the next generation of training. It got me thinking, if we're headed for a world where the quality of training problems becomes a differentiator, they might have the best feed of well-structured, hard technical problems solvable with advanced software tools. He has these frontier companies in multiple domains that seem to do a good job of structuring problems. Anthropic and OpenAI don't seem to have something like that. Google does, but it's diffused across its vast archipelago of fiefdoms. I could see Elon structuring that pipeline of hard problems into an RL cooker at xAI and winning because of his access to the best engineers working on these really hard things. Does that seem at all credible to you?

Nathan Labenz: Not really. I don't think that much data naturally occurs. These companies aren't that big, especially SpaceX. Beyond that, if that is the thing that matters, then there's a lot of data in the world to be collected, and there isn't much of a barrier to getting it. There's nothing stopping Google, OpenAI, or even Anthropic from making alliances to get that data, and there's no reason why those companies wouldn't be happy to help them in exchange for not much money. So, I don't see it as that big of an advantage.

Zvi Mowshowitz: It just seems hard. I agree there are other companies; for example, other car companies. There aren't exactly other space companies, but let's say car companies. You have all these engineering things happening at Tesla, and they're happening at other companies in varying degrees.

Nathan Labenz: There are plenty of other companies that build things, right?

Zvi Mowshowitz: But can you imagine going to a company like General Motors, in my hometown, and saying, 'Hey, can we extract your hardest engineering problems, structure them, get clarity on the answers, and train our AI on that?' Even if the CEO of GM agreed, I feel like it would take at least ten times longer than it would for Tesla to do a similar thing.

Nathan Labenz: Why?

Zvi Mowshowitz: I've done a little work with GM. For the same reason nobody else has anything close to a self-driving car. Only Tesla and Waymo have come far in that domain; everyone else has given up. They just don't seem to have the organizational juice to pull things like that off.

Nathan Labenz: Executing a long-term, complicated software engineering plan is very different from collecting a bunch of data. What data do you have to collect? Google has Waymo, so they already have infinite driving data at their fingertips. It's not that hard to put cameras on cars if you want to collect driving data. It's not that expensive, considering how much they're spending on training runs and acquisitions. I just Googled GM's market cap; it's 55 billion. If you were OpenAI and this was so important, you could just buy General Motors.

Zvi Mowshowitz: I wouldn't recommend it.

Nathan Labenz: Why not? Imagine if you could buy General Motors and launch self-driving cars relatively quickly using OpenAI's techniques. Couldn't you generate a lot more value than 55 billion? What's Tesla worth? Why is it worth so much more? Are you sure we shouldn't buy General Motors? I'm not saying they should, but if the thing preventing OpenAI from winning the AI race is not owning General Motors, they can just own it. It's not that hard, and there are big synergies.

Zvi Mowshowitz: The thing that I see being hard to reproduce, even if you bought General Motors, is something that companies like Tesla and SpaceX potentially have: really clean environments. It's a well-oiled machine where data is flowing and vertical integration is deeper. The mess of the supply chain at GM, all the-

Nathan Labenz: But what does that have to do with data?

Zvi Mowshowitz: ...all the suppliers, all that nonsense.

Nathan Labenz: What does that have to do with the data collection?

Zvi Mowshowitz: Data collection can mean multiple things. Cameras on cars is one version, but I'm also thinking about problems. For example, we wanted to design something that met a certain specification. Eventually, somebody did it. What was that design and where does it sit? That information seems much more accessible and ordered at Elon's companies compared to legacy manufacturing giants that have already declined.

Zvi Mowshowitz: a bit already.

Nathan Labenz: I see various points of this story that don't make sense to me. For one, that this would be that big of a deal, especially given Google has Waymo, which is the only company actually doing the thing as far as I can tell. I don't like this myth of Elon being a super executor. Elon has not been the old Elon for a while now, if you judge by the quality of his public statements and decisions. This includes blowing up his very close relationship with the President of the United States over nothing he stood to gain. He was the right-hand man for tech to Donald Trump, and then he got mad about the deficit, something he had no influence over and shouldn't care about given his belief in AGI. He broke the entire relationship consciously and intentionally, and now that person seems to be Jensen. It's a disaster for the United States. His influence going away did not help anything Elon Musk cares about in any way, shape, or form. His life is just worse. Everyone on all sides basically hates him. If you look at the self-driving car situation, they have been promising these cars 'real soon now' for how many years? The same promise over and over again. I'm not saying they're not making progress, but it's been way behind any schedule he's told us to expect. He's over-promised and under-delivered for about a decade. So I'll believe it when I see it. Also, is it a data thing? I don't think so. I often hear stories about why someone will win because they have a certain advantage. But if that advantage was so important, you could just go get it. So for it to matter, it has to be the important thing, and everyone else has to not realize it until it's too late. You still there?

Zvi Mowshowitz: Yes. I would say FSD, for what it's worth, is getting very good. It's been a couple of months since my last FSD ride, and it had been a year between that one and the one before. The progress was very obvious. You no longer have to keep your hands on the wheel, for example, which shows an increasing level of confidence. I was very impressed. I'm also super impressed by Waymo, but I wouldn't say FSD is-

Nathan Labenz: I'm excited about-

Zvi Mowshowitz: ...far behind.

Nathan Labenz: I'm excited for it, but I still don't have the ability to have a car drive itself. So, there you go.

Zvi Mowshowitz: Waymo is supposedly coming to New York City pretty soon.

Nathan Labenz: I'd be totally up for that with the New York City Council and the mayor's office.

Zvi Mowshowitz: I think I just saw this on Timothy B. Lee's Twitter feed.

Nathan Labenz: A car has been spotted in Brooklyn driving around, mapping the city. That's great. But there are laws on the books that say they can't operate. What is the plan to deal with that? Don't get me wrong, I want this to happen so badly. I want my Waymos. I will forgive the new mayor many things if he brings us Waymos. But-

Zvi Mowshowitz: You could take your government Waymo to your government grocery store.

Nathan Labenz: I don't want a government Waymo. I don't want a government grocery store, but there you go.

Zvi Mowshowitz: Okay, last area for today. We both just participated in different cohorts in the Survival and Flourishing Fund grant-making process as recommenders.

Nathan Labenz: Yes, we did.

Zvi Mowshowitz: I'd love to hear your thoughts on the broad survey of the AI safety charity landscape, from cause areas to specific organizations to anything you think is neglected. What did you take away from it? There were over 400 grants, a bunch were pre-filtered out, but we still had 125, I think. What do you make of it?

Zvi Mowshowitz: ...of it all?

Nathan Labenz: The first thing to make of it all is that you can't actually evaluate 125 applications, let alone 400, in the time they budgeted for us. It's just impossible. You can properly investigate about 10 organizations, and then you have to evaluate everyone you think deserves consideration for funding, which is a lot more than 10. So you're relying heavily on your fellow recommenders, reputation, and past research. One advantage I had was being in a recent round of SFF, as many of the same nonprofits were applying again. I could request a diff on those organizations rather than starting from scratch, which is a huge time saver. Many of the charities that ended up near the top for me were basically the same ones that were under serious consideration last time, because the situation hasn't changed much. I had the same number one as the previous round, which was the AI Features Project. That's Daniel Cook and Taj Lowe, the people who did AI 2027 between the last grant and now. I feel this is a place where I can take a victory lap. There are no downsides to that. It still feels like a good hit, given I was the only one who put them that high last time. This time, there's a big consensus among several recommenders that they should be high. You can do a big divide between policy versus research. Are you trying to solve alignment in some form, directly make the world better, or shape public opinion and policy, propose laws, and file lawsuits to set better policy? I wanted to do both; it wasn't obvious that one was strictly better than the other. You're largely looking for what is underfunded and what won't be otherwise supported by the ecosystem. One question I asked a lot was, what is SFS's comparative advantage? Where can we identify talent and opportunity in a way that would be difficult for others? Last time, that was the AI Futures Project, which was unable to get certain other funding at the time. This time, I had the CAIS Action Fund pretty high, simply because it's harder to raise funds for a C4 than for a C3. Not because I felt the Action Fund money was better spent on average, but that the distribution would naturally weigh too far the other way. ACS Research was one of my top picks specifically because they were running out of funds. I'd seen them do valuable things and thought, "I need someone doing valuable work not to just fall over and die." It seems valuable that you stay in the ecosystem and keep your organization without constantly searching for another home. You can do what you think is valuable. Ultimately, that's how I had Neri this year. Neri didn't need funding for many years because they got some very large donations and did the virtuous thing of not asking for donations while they didn't need the money. Now they need the money. I hope we come together and ensure they continue their quest. Those were some things that stood out at the top, but I have a very long tail of places I would be happy to fund. If you were to allocate all the money for the entire round and ask what I would fund, there are 10 organizations that would get at least $400,000, another 10 that would get at least $100,000, and a long tail that would get some amount of money. I could go on about more orgs. I'm planning to do another nonprofits post later in the year, in advance of Giving Tuesday, to give my updated views.

Zvi Mowshowitz: Nice. My written output is obviously not 2% of yours, but I'm planning at least a Twitter thread on that topic as well, so we can compare notes as we get into the long tail.

Nathan Labenz: It's a lot of work, and I'm going to have to set aside specific time for it. I don't know when they're going to announce the results, and I want to finalize my post after that. I want to see who gets how much money and who is on record as having received money, making them a public part of the round. Anyone who doesn't receive money is not publicly part of the round unless they say so. Therefore, I have to email each of those people and ask, "Do you want to be in this post?" This gives them the opportunity to email me and say no. I treat no answer as a yes because, in general, charities want people to say you should give money to them. But occasionally—

Zvi Mowshowitz: It's safe.

Nathan Labenz: someone doesn't want that.

Zvi Mowshowitz: That's usually a safe assumption. Another category I found interesting was international relations. I know you're not very bullish on US-China cooperation. I'm not arguing you should be, but there was a crop of organizations trying to work on that, and I was pretty into it.

Nathan Labenz: One of the issues with the round is that there were some organizations I knew were doing good work. They had clear wins, or I knew the people involved and what they had done, so I could be confident in them. It was very hard to give similar ratings to organizations where I didn't have that background. It's a question of— I won't name them because I'm not sure they're being funded yet and it hasn't been made official. But there were some charities trying to create Track II talks or otherwise advance US-China relations on AI, and I gave them some support. I definitely rated them as fundable, but I was hesitant. Diplomacy is notoriously difficult to read. If one of these charities was effectively fake, in the sense that what they were doing had no real effect, would I know? Would they know? If they're bad at it,

Zvi Mowshowitz: Yeah, that's a tough one.

Nathan Labenz: they might not even realize it. They might be doing something, thinking they're accomplishing a goal, but not actually accomplishing anything. Diplomacy can look like you're doing nothing for 10 years, and then suddenly something happens. Or, it can look like you're working on it, nothing happens, and that was actually the right thing to do. You just create the possibility of something happening, and things go a different way, or you prevent a bad thing from happening without even realizing it. So it's high leverage, but you just don't know. Because of that, I find it difficult to get fully behind any of these organizations. Some of them will definitely make the post as something I think would be reasonable to support. But it's really tough when you have money that could definitely be well spent to put it into an impossible-to-read area. When you don't get good feedback, you can't tell, which is all the more reason to ask, "How do they know how to do things properly? How do they make good choices even if they're properly motivated?" They can't tell either.

Zvi Mowshowitz: Another category, this is a meta-category, is the California Bill SB 1047. I'm sure you've engaged with it a little bit. It would create a private regulatory market where the Attorney General or...

Nathan Labenz: Okay.

Zvi Mowshowitz: perhaps some new commission would credential private organizations to be regulators, and then there would be some

Nathan Labenz: Right, right.

Zvi Mowshowitz: sort of trade where if an AI developer opts into regulation from one of these private developers, they would get some liability protection in exchange. That obviously begs the question: who steps up to be these private sector regulators

Nathan Labenz: Oh, yeah.

Zvi Mowshowitz: in the event

Nathan Labenz: I guess.

Zvi Mowshowitz: that this bill were to become law? I was looking out for who I thought could become that if the opportunity were to arise. I don't know if you've thought about organizations in those terms previously.

Nathan Labenz: Yes, a number of nonprofits could plausibly step up and become auditors in this space. There are also many founders capable of creating new organizations to do that. If there's demand, there will be supply because it's not hard to find expertise in this space that would be happy to participate in these organizations. I just don't want to name specific names or fall into the trap of, "Oh, this must be regulatory capture for these five people or five orgs." I don't think that's the case. I don't think there's going to be any problem with that. The most likely scenario is that companies like OpenAI will quietly ask people they expect to do good jobs to spin off organizations they can work with, if they're not happy with the initial options from the natural occupiers. But certainly Meter or Apollo, for example, are already being contracted for evals by the big companies and would presumably be the first ones out and very credible in that capacity.

Zvi Mowshowitz: The last category I'll present is hardware governance. There weren't too many organizations specifically working on this, but my group observed that everyone seemed to have a different reason to favor hardware governance. What are your thoughts on it?

Nathan Labenz: Unfortunately, at the moment, hardware governance is politically very difficult. I wouldn't say it's dead, because things change quickly, but it's not looking promising. The current focus is entirely on getting people to use our hardware, and the last thing users want is to be tracked. This directly conflicts with current priorities, so it won't happen right away. I still believe it's vital that we have this capability. It's crucial that we have the technology completely shovel-ready, so if we decide that three months from now, every new chip shipped must be tracked, we can do that. Ideally, if we need to go into a data center and place trackers on chips in a way that reveals tampering, we should know how to do that as well. Potentially, a very small investment could give us that option, and then we'd only spend significant money if we implement it. It gives us optionality and allows us to develop these solutions. I think the best immediate solution to the current situation is to use this capability to enable things like building more secure data centers in the UAE and India, and to stop chip smuggling. It could also allow us to be more aggressive about what we permit individuals we're concerned about to do. Again, I don't think we need much effort; just a little effort to lay that foundation, and then it's about figuring out who the key players are.

Zvi Mowshowitz: What about just the simple question of the balance between available money and opportunities? For me, the feeling was that I wished I had more money to give out. Obviously, that's often the case. But if you were to pitch to other philanthropists that there's a lot of high-impact work that isn't funded as much as it should be, first, do you believe that? And second, what would that pitch sound like?

Nathan Labenz: That pitch would sound something like this: If I wanted to distribute the entire projected round of, say, $10 million, it wouldn't allow me to give out all the money I would have been happy to distribute, not by a long shot. I could easily hand out more than double that and feel good about every dollar. Of course, I would want to investigate some things more, because there are areas where I'd never give money, having already calculated that. However, you could give tens of millions of dollars to these applications with basically zero worry. That is clear evidence that there is infinite philanthropic space. Also, there's a lot of work that is at a scale above small-dollar funding, which we basically can no longer fund as charities because their capacity to effectively use money has grown to millions of dollars a year, and we just don't have that capacity, or in some cases, even approaching tens of millions or more annually. There are also many philanthropic projects that were never even proposed because their price tag would be absurd. And everyone in this space is, of course, conserving money, which isn't ideal. We prefer to have generous salaries and compute budgets and not worry about these things, but that would require a lot more money. But even without changing that, even keeping everyone focused and not pursuing massive new experiments or anything like that, just doing what we're already doing, I feel confident that a lot more money could be invested in this space than currently is. It's just not available right now. So, it would be great if you could help out. There are many places to put it. And again, many people simply aren't asking for money because they know there are so many others asking. I'm in that position; I'm supported by patrons who are happy to fund my work, but I won't actively seek more funding because I know there's far more demand than supply. It doesn't feel reasonable for me to ask for much more. But would I scale up, at least somewhat, with more funding? Obviously, yes. There you go. And I also have a lot of research that, again, is trying to be lean, so we're all trying to be lean here.

Zvi Mowshowitz: Here's an idea I want to get your take on. This wasn't in the application pool, but I saw an article the other day, which I'm sure you also saw, about how OpenAI is starting to subpoena some charities. Some of these were in the application pool and are doing things that OpenAI finds inconvenient, like hassling them about their nonprofit to for-profit conversion. They seem to believe, and the reason they're giving for these subpoenas, is that these charities may be funded by competitors. In other words, they think perhaps Elon or Google is funding these organizations to try to slow OpenAI down. This got me thinking, maybe that could happen at the model evals level. We have these model evaluation companies, organizations, and nonprofits that are very focused on being even-handed, fair, analytical, and trying to do things pre-release, which I think has a lot of value. But that forces them to play very nice with the companies they're working with. I wonder if someone came forward and said, transparently or not, "We're going to go after all the companies except Google and demonstrate to the public why their models are problematic, shouldn't be trusted, and all the ways they go wrong. We're not targeting Google, perhaps because we're funded by someone else." Could you engineer a situation where all the companies then feel they better target their competitors' models and invest in demonstrating what is wrong with their competitors' products? If you could create that equilibrium where they're all sniping at each other, would that bring things to light and potentially create the race to the top that everyone wants? It seems the only reason that isn't happening is maybe a sort of soft collusion or unspoken gentleman's agreement. But if somebody were to break that, maybe it all goes to a different equilibrium where everybody is investing in that, and we have a lot more energy going in that direction. What do you think?

Nathan Labenz: It's not a great look from their perspective to be funding attacks on these other companies. When you expose these things in other companies, you are also exposing yourself. Almost always when you find these flaws, they're everywhere in some form, and it will seep into public consciousness and could lead to calls for greater regulation or other escalations. You don't need collusion to think, "My gang has guns, your gang has guns. Why don't we just stay away from each other's territory and not shoot at each other, because that could get ugly pretty fast." Most of the time, Coke and Pepsi don't start smear campaigns on each other; they just do positive advertising. Maybe they take a few cool little side shots, but they don't fund nutrition studies about why the other one is unhealthy, because that doesn't really work. But certainly, you could fund people to go after specific organizations or do deep dives on their models. You can have opinions on who to target first, and I don't think that's crazy. Some companies are being less responsible than others and deserve to get hit more. If you target Anthropic, they might just say thank you. So there's that.

Zvi Mowshowitz: Yes, that's why when I said, "Target everybody but Google," I was thinking the rationale is that there are a lot of Google billionaires who could plausibly be funding such a thing. The reason they might do it could be a mix of competitive advantage and a philanthropic desire to bring issues to light. And it might-

Nathan Labenz: I think the problem is that specifically not targeting Google raises questions. If you're a Google-funded organization that only targets OpenAI, for example, then why would I think your evaluations are objective? Why would I believe that when you say something is a problem, it's a real or particularly significant problem?

Zvi Mowshowitz: I think the idea would be that-

Zvi Mowshowitz: ...it's just reproducible. If you have inputs and outputs from models and just demonstrate this is what's happening-

Zvi Mowshowitz: ...that's pretty...

Nathan Labenz: I didn't mean, is the finding real? I meant, what we're trusting is that you followed scientific procedures in selecting this example. That it's representative, that it teaches us what you think it teaches us, et cetera, et cetera.

Zvi Mowshowitz: Yeah, I see that challenge. Although I also think that's all pretty slippery. I mean, as much as there is a very sincere desire among the evaluation groups today to have these high standards of rigor, all their stuff is always questioned. And you've got a lot of people who are like, Oh, well, this is totally nothing because you put the model in the situation.

Nathan Labenz: Look, it is the job of Caesar's wife to be above reproach and be reproached anyway. It is the job of those who are trying to be the watchdogs in these situations, like the people who are trying to be in our position. We have to follow standards of rigor and integrity that are vastly above what others are held to. And that's table stakes. That's just the right to play. And it's not fair, and that's just life. You're still going to have all that questioned and attacked. Yeah, absolutely. We've all been, we all saw the debate over 1047. We all saw how you bent over backwards to be 10 times better on all of these issues and all of these questions than the people you were opposed to, and it didn't... You had to be able to play in the arena. You have to, if you're the underdog, if you're the scrappy, underfunded person who's trying to bring the truth to light, that's your job. The big corporation is going to try and squash you with anything you've got. You've got to be sparkling clean, you've got to have no vulnerabilities and no points of leverage. No smear. It's just how it is. And it sucks, but you know, got used to it. All right.

Zvi Mowshowitz: So does that mean you basically don't... The equilibrium that I'm trying to envision a way to shift from and to is one where today... And there was this mutual adversarial collaboration. I don't know if it was really adversarial, but there was this OpenAI and Anthropic evaluated each other's models seemingly in a pretty friendly, collegial way.

Nathan Labenz: That was great.

Zvi Mowshowitz: So yeah, I think everybody loved to see that. But that's not happened much. Maybe it'll happen more again in the future, maybe it'll never happen again. I tried to engineer a transition to a different equilibrium where everybody is adversarially evaluating everybody else and thinking, Maybe if I tip one domino, everybody else will feel that they have to respond. And then from a general sort of safety-ist worldview, it would seem better if they were all adversarially evaluating one another versus not. So I guess you could question that first or question that assumption of, That would be a better equilibrium. And then the other question is,

Nathan Labenz: I would love

Zvi Mowshowitz: can we tactically get there?

Nathan Labenz: I would love to get to a point where the companies were doing each other's evaluations in an adversarial fashion, looking for trouble, looking for vulnerabilities, looking to embarrass them, and they just had to see if you can deal with it. See if you can beat it. See if you can overcome that. That sounds great. I would love that. I don't know how you get there from here. I think that these companies do not want to go to war with each other. I don't particularly want them to go to war with each other in other ways. Also, people are constantly like... I mean, you've already got them going to war over talent, so I don't know. But I can see it being a thing Anthropic might want to do in the medium term is to say, We run all of our evaluations against everybody's AIs and we report them back. And occasionally we're going to find some stuff. But yeah, I don't know.

Zvi Mowshowitz: Yeah, they did do that with DeepSeek. They did come out and say, DeepSeek has no qualms about doing bio weapon-type stuff.

Nathan Labenz: Yeah. Their evaluation of the DeepSeek safety protocols was... Safety protocols, right? Yeah. But yeah.

Zvi Mowshowitz: Okay. Well, if any Google alums with the resources want to talk about seeding such a thing, my DMs are open. Always the final question for you in particular, what that we haven't talked about is virtuous to do now?

Nathan Labenz: Yeah. I think it's a weird situation where it can be really tough to figure out where to make the most meaningful progress, what to do going forward. On policy, the short-term priority has to presumably be, prevent America from being so foolish as to sell B-30 base to China, which in practice presumably means getting enough people on the right sufficiently alerted to this is actually happening and here's what this actually means. They raise enough stink that it doesn't actually happen. And otherwise, draw attention to the extent to which NVIDIA seems to have taken hold over the White House in terms of its rhetoric and its plans overall. Not to... I mean, this is obviously the ultimate end goal or the primary reason for that. But yeah, you see it in everything. Obviously, as usual, trying to spread the better narratives is always good. I've certainly gotten to the point where I think that working for Anthropic seems to clearly be a net good idea at this point, if you are considering what to do and the alternative is doing basically nothing. I do think there are a number of works which are presumably better choices for impact than just working at Anthropic, but that doesn't mean that they have capacity or that you want to work there obviously. But yeah, it's a difficult situation because obviously the policy situation is in a bad state. Alignment is in a not great state. There are obviously infinite things to work on, infinite things to experiment with, infinite organizations to give money to, and so on if you want to do that. But I mean, my product is basically, keep trying to keep myself and others informed and understanding of the situation and hope that that will lead to good things more than anything else. But yeah, I wish I had a better answer obviously for a call to action like, Now that you've listened to this podcast for three hours, or depending on what speed you're listening at, what are you going to go forth and do? But yeah, unfortunately, I don't really, other than think hard about the world and try to figure out what under your model would be the right things to do to advance it and what are the things that would actively make things worse? Because for a lot of people who are informed, their first step is just be aware of what would make things worse and don't do that. But I mean, besides, you should say what you think more than anything else. You shouldn't sugarcoat, you shouldn't engage in hyperbole, you shouldn't strategically censor yourself. With notable exceptions, you should just say what you actually believe about the situation. And one thing you can do is you can support the book release of Eliezer Yudkowsky and Nate Soares. They are coming out with a book in about a week, If Anyone Builds It, Everyone Dies: Where the End is Superintelligence. And if you were to help by purchasing and spreading the word about that book, this is a unique opportunity to get that, have momentum, maybe create a cultural moment. But again, that doesn't mean that you should just back their talking points and their idea about how this works just because they're the ones with the book or just because Eliezer is the rightful caliph and he said so or anything like that. You should make up your own mind.

unknown: I've pre-ordered my copy and look forward to reading it. As always, I really appreciate your time; you've been very generous with it. We'll continue to stay informed via the blog, Don't Worry About the Vase. That's it for today. Zvi Maschwitz, thank you again for being part of the Cognitive Revolution.

Nathan Labenz: Thank you for having me. Alright, bye.