Introduction

Hello, and welcome back to The Cognitive Revolution!

Today, my guest is Manu Sharma, founder and CEO of Labelbox, a data factory that supplies frontier training data to all of the top Western AI labs and many enterprises that are pushing the performance frontier with task-specific, fine-tuned models.

This conversation couldn't be more timely. 

In the wake of Meta's recent $15B deal with Scale AI, which saw Scale's former CEO Alex Wang join Meta to lead their "superintelligence team," other frontier model developers are scrambling to secure their training data pipelines, and the market overall is still in the process of realignment. 

These headline-making developments have really spotlighted just how critical super-high-quality data is to today's frontier capabilities, and how difficult and expensive it can be to create.

Post-training budgets are growing rapidly as labs race to imbue their models with differentiated capabilities, and as Manu explains, every Western frontier lab is now spending over a billion dollars annually on frontier training data.  

Such tremendous investment is needed because modern data work has moved far beyond the simple labeling and preference indication tasks that many are familiar with from years past.

And so, with Manu as our guide, today we'll be tracing the evolution of post training from Supervised Learning from human examples, to Reinforcement Learning from Human Feedback, to the Reinforcement Learning from Verifiable Reward paradigm that's ascendent today, and unpacking how that's shifted data creation work away from tools that facilitated the collection of human reasoning, to environments, which Manu calls "gyms", where models can develop new skills – starting with coding, mathematical reasoning, and computer use – through a process of trial, error, and highly automated feedback.

The bottom line is that today, when you think about training data creation, you should envision, not the data sweatshop of the past, but highly qualified and comfortable domain experts working to create sophisticated Reinforcement Learning environments and verifiers that teach frontier models to solve complex, long-horizon tasks.

A quick disclaimer: Labelbox will be sponsoring the show for at least the next month, and so this does qualify as a "sponsored episode."  Nevertheless, I can sincerely say that this is a conversation I'd have wanted to have anyway - because whether you're tracking AI progress analytically or trying to achieve superhuman performance in your particular AI application, understanding the best practices for training data creation is essential.   

And besides, the scale and scope of Labelbox's operation is genuinely remarkable.  Their most recent capital raise was $110M in early 2022, and today Labelbox operates as a vertically integrated data factory.  They're scaling their expert network in part by conducting over 2,000 AI-powered interviews per day, and their most in-demand experts are already earning more than $250,000 per year on the platform. 

I haven't yet had the chance to sign up to earn some of that money myself, but I am looking forward to the AI interview experience and – assuming Manu's right that the system does recognize me as qualified expert – I'll report back, subject to any required NDAs, on my experience as an AI trainer. 

Now, without further ado, I hope you enjoy this deep dive into the ongoing evolution of the red-hot training data market, with Manu Sharma, founder and CEO of Labelbox.



Main Episode

Nathan Labenz: Manu Sharma, founder and CEO of LabelBox, welcome to The Cognitive Revolution.

Manu Sharma: Thank you.

Nathan Labenz: I'm excited for this conversation. Your world is in chaos; the AI world is undergoing a reorganization or realignment right now. When I say your world, I mean the world of data: data creation and human sources of data. You've been in this business with LabelBox for a number of years, and we'll have a chance to dig into all the different facets of that. The news that has sent shockwaves through the industry in the last couple of weeks is obviously Zuck and Meta's unusual deal with Scale, where they're bringing Alex over to lead the super intelligence team, or something similar. Then, of course, we have people leaving Scale, and it seems generally chaotic. So, what's your report from the front line? Is this driving a lot of opportunity for you? Are people confused? What interesting perspectives are you hearing? I'm very interested in your Gonzo report from the front to get started.

Manu Sharma: This is one of the most exciting times in the AI industry and generally across the board, right? When you really look into the innovations and pace of progress, I believe we are experiencing the maximum innovation we have ever seen in a per-day or per-week time period. It's an AGI race, a race among a number of companies and groups. It's interesting to see that there are big AI labs playing and pushing out awesome capabilities for base models, as well as products and experiences. But then you have a number of teams emerging with new ideas, taking a bet on alternative techniques and so forth. One thing clear underneath all of that is that these teams need three components to develop frontier AI systems. One, of course, is compute. We've been seeing incredible investments in CapEx across the board. The second is AI talent; these are the researchers who know how to craft these neural networks, their architectures, how to train them, and what to train on. A third, equally important piece, is data. We are very much in a regime where a lot of the emphasis is going towards post-training. In the AI world, the way these AI models are trained is you initially have a first phase where you take all the data from the web and sources you can access, essentially training a base model that can understand the patterns of all human-generated data over, say, 100 years. After the base model is trained, there is post-training. Post-training emphasizes aspects of the AI models we interact with; it truly turns these AI models into assistants. Within post-training, a number of techniques have emerged or become very important over the last three or four years. Regarding your question, it's awesome to see so much limelight thrown at this industry, specifically on how companies produce this post-training data. I think the news about Meta essentially emphasizes that data is very important for building these AI models and capabilities. Our industry, generally speaking, has been going through a number of exciting step changes. When we started LabelBox in 2018, supervised learning was the main swing. Everybody was creating a lot of training data for supervised training. You had humans around the world who would tag images, videos, and text snippets, essentially training these models to mimic that behavior. Around the time transformers came out, we started to see the inflection towards unsupervised learning, where most of the learning happened on large-scale datasets. However, as we emerged through that new paradigm, RLHF became a prominent way to make these models useful in everyday life. This also required new forms of data, very specialized datasets with expertise, because these models now have strong base capabilities and needed to learn the ways and skills to interact with knowledge work and knowledge workers. So RLHF and SFT became main things, and now we are in yet another new paradigm of reinforcement learning. Nearly every AI lab I'm aware of is emphasizing creating datasets for reinforcement learning. It's like it reappeared. Back in the day, we saw amazing innovations from DeepMind with AlphaGo and so forth, which were trained on reinforcement learning, and it reappeared now but in a new flavor, a new form. So that's the high-level trend we have seen. All these things require specialized datasets, and they need a lot of them. I would say we are still in the very early days of actually making these AI systems more and more capable for everyday tasks that power the economy across knowledge work. That's what's going on at the macro level. These AI players who are building large, flagship AI systems or models are very aware of how important it is to have access to ways of producing a whole bunch of datasets in all these interesting specialized domains that ultimately get into useful business domains. This is where AI can learn about, say, what software engineers do in their everyday tasks when building companies and software. Or, if you have domains like healthcare or insurance, how do you learn from core workflows that produce datasets to create these models? It's generally very interesting and exciting times, and it is common to expect M&As and certain teams making decisions as they see fit for opportunities in an industry like this. LabelBox has been behind the scenes providing these datasets to most of these AI labs. It has shifted workloads, maybe a bit quickly at times, sometimes taking time for teams to reorient to new supply chains. That's the macro context of what's going on, and I certainly expect even more changes in our industry over the next few years. There are new ways and new teams innovating on how to create datasets, how to help these AI labs produce these datasets. So that's what's happening. It is generally very interesting and exciting times for people who are keen on how these datasets are created and how these models are learning.

Nathan Labenz: Let's dig in a little bit more on the kind of nature of post-training, of, you know, the different techniques, and then, like, w- maybe some of the subtle differences. So a year ago, maybe, the prevailing wisdom was like maybe 98, 99% of compute was going to pre-training, and then this post-training, while the data was really obviously critical to make it work well, was like a small thing that you could rapidly iterate on. It seems like that has changed, but it's unclear, like, how much it's changed and if everybody's changing in the same way. I'm always very confused by are we experiencing convergence or divergence in the leading model providers' offerings? There's definitely striking convergence, but there's also maybe some divergence and maybe more divergence to be expected in the future. And I think about this anecdote from the Claude 4 training, where they had mistakenly left out... I'm sure you've seen this, that one sys prompt harmful dataset, and then discovered it behaviorally, as I understand it, that the model was following harmful system prompts. They're like, "Why was that?" Ultimately traced it back to, well, we omitted... you know, we had a typo in our config files and this one dataset that was designed to teach the model not to follow the harmful system prompts was omitted, and so that behavior was not learned as intended, and then they didn't go back and redo all post-training with the right configuration. Instead, they patched it or made a, an on-the-fly adjustment later. So from that, I infer that the post-training has now become a significant enough part of the overall compute that, you know, a company like Anthropic, as committed as they, I think, mostly credibly ca- claim to be to getting this stuff right, didn't feel like they had the luxury of going back. So what can you tell us wi- obviously without sharing client secrets, but in a sort of abstracted way from any one strategy, how much compute would you say is now going into post-training? Is it as simple... I'm sure it's not. Used to be pre-train, then some SFT, then a little RLHF, then you're done, or at least that was how the public understood it, the AI public. I assume now there's more interleaving, more steps, and just more complication. So yeah, tell me more about what you think that kind of post-training has evolved into.

Manu Sharma: Yeah. So I'm confident that in, in all of the big leading labs where, including ma- many of the hyperscalers, that they are actively researching and innovating across the wide sort of front of end-to-end training recipes. So I'm sure there are teams that are pursuing some new ideas on how do we make pre-training more efficient and things like that, and the right kinds of data mixtures and so forth, but you have to know that pre-training can take a long time. And so let's say if it takes a few months to actually train a very large model and so forth, you know, it is, eh, in a release cycle when these companies are shipping these products, that is like a given. Okay, it's gonna take X amount of time, six months, let's say, to really get something out the door. And of course, we are seeing that some of these teams are training gigantic models as a base model, and we'll talk about it in a bit. These models are many, in many ways used as a teacher model for, for student models that are actually gonna ship. Now, I would say that the v- very growing budgets are go- eh, going into post-training, and I think the intuition that I will give here is that in many ways these base models have very strong capabilities about understanding the patterns of data and knowledge and so forth across wide range of modalities. However, what we are still finding is that these models now have to produce useful work in the form of, eh, you know, maybe you would call it agents and, and form of like, you know, long horizon tasks. And to solve that problem, these research teams have to use techniques to teach that capability. And so there is sufficient knowledge and capabilities in the base model, but you have to then go out to teach these models like how to do X, Y, Z tasks. Let's just take an e- ex- very simple example. Like in software engineering...These models obviously are so awesome in terms of what they can do in coding, but there's so much still to be desired when it comes to professional software development at large scale, like how can these models develop entire suite of products with minimal supervision? I would say I think, again, we were starting to see that, but it's not there yet. And to solve that problem, these research teams have to collect the right kinds of datasets and then do reinforcement learning on that. And so that is where a lot of focus is going across the board for these teams to test new t- recipes and new techniques and see if the models are getting better in those sort of capabilities. And so they have generally a team that is... or a number of teams that are exploring that frontier. Like, how do we get these models really good in all of these sort of categories? I think the two obvious macro ones is coding and math, because the fundamental way these RL systems work, in coding and math you can verify to a large extent, like, you know, di- did the model got the solution right or not, right? So in... let's take a few examples. Like in coding, let's say I come up with a PR request and the PR actually solve the stated objective, and does my test pass in the code base? You know, in the prior tests and the new tests. And in RL, essentially the system will try millions of attempts to, to solve that problem, and if it gets right, it, it scores really well, and then the gradient... uh, the weights change with the model and then they keep going to the next one. And similarly in math, for large ex- there's a large number of problems that can be verified by a numerical answer, and so the kind of very fast progress we saw in the reasoning models and the coding capabilities in the last, I would say, nine months or so is because of that. These teams are putting a lot of emphasis on just getting that right, and A, it is a very valuable thing to do, because coding could be one of the most lucrative knowledge work in many ways, and some teams have a belief that they can actually... if they solve that, they can actually accelerate their AI research and so forth, so it certainly seems like Anthropic believes that, many others do. And then there is new innovations or new ways to train these systems in non-verifiable re-... regime, where a lot of our day-to-day work is you don't really know what the right answer may be, and there may be multiple right answers, but how do you model the data where these RL systems can track their progress and, like they're... are they getting it right? And if they create a candidate solution, how can they score that candidate solution effectively? And so we have some really promising techniques right now that is underpinning a lot of these kind of latest capabilities across the board, you'll see across the models. Now, I do think that at a very high level, these models seems to feel, like, converging in the sense, like, they interact and they look and feel the same from an outside. However, there is this philosophical principle where to be able to really critique a capability, you really have to be at an expert level to really understand the nuances of quality and, and really... e- kind of, like, then have sort of a quality judgment like, you know, this thing... this model is really good at this thing, but not that thing. And so there is convergence maybe at a very high level when you look at very generic capabilities, but there is actually a lot of divergence when you go into the details. So I code quite a bit. It's awesome for me to be able to, like, get back into coding now with AI assistance, and there are many times where I am working and solving on a prob-... solving a kind of a very tough problem in coding, and you're kind of stuck and you're iterating with these models and you are able to have an intuition, at least I get an intuition where, you know, I've hit the ceiling of this model. I'm, like, in a, in a trajectory where I'm not really, uh, gonna get a... uh, get this model to s- solve this problem, and, like, I'm able to see that across variety of models where I have a intuitive judgment like, you know, model Gemini could do... solve this problem, maybe because it's, like, very algorithmic and requires some kind of math domain to truly solve that. But maybe when it comes to really refactoring the software or architecting a distributor system, I might use Claude because it has this amazing property where it is asking a lot of questions or in an internal state and getting no- more knowledge from reading this file, that file, and ultimately trying to make a better kind of software decision. And from a macro, it looks like convergence, but when you actually go into the details, these models are getting optimized for very different goals, it seems like. Now, I do believe that the ambition among these AI labs is to make a generic systems that are just so great across the board, but in reality what we are seeing is that companies are also starting to optimize experiences where the user bases are. For example, in Anthropic's case, it clearly is that they are... they're very focused, they're very intentional on making... solving coding. And when you look at maybe some other models, maybe they're more optimized for consumer use cases and everyday tasks and so forth, and some other labs would be focused on making a general social kind of assistant, everyday assistant, and that would have different sort of optimizations to have. So that's what it, to me, feels like, what we are... we are at right now. Now, there is a very interesting question, I do think, that is, like, do... ultimately, when these teams figure out how to make the best coding system, how to make the best researcher for scientific domains, do they... are they able to bring it all together back into a single world model? And I think that remains to be seen.

Nathan Labenz: Broadly, I want to understand better what the role of human data is today. One way to think about it is: if we didn't have this human data, how would things be different? How would they fail? I'll give you a bit of a prompt. On the math side, we've seen from papers like DeepSeq R1 0 that a model can learn highly verifiable domains like math and coding pretty well, even without the human traces typically used in supervised fine-tuning. But that comes with important downsides; for example, it starts to operate in strange ways and generally acts very strangely. Language switching is one example of odd behavior. I think much more odd behavior would emerge if we detach ourselves from the human baseline and spin the RL centrifuge more intensively. So, one value of human data is that it provides an anchor, giving us hope for some sort of alignment by default, because we know the starting point. Another interesting juxtaposition is GPT-4.5 versus O3 Mini. GPT-4.5 has excellent trivia knowledge, significantly surpassing O3 Mini in its ability to answer random questions about the world. However, O3 Mini is much better when it comes to reasoning challenges. So, we can decouple reasoning and broader behavior from world knowledge. Then, there's a difference again between supervised fine-tuning, where my general understanding has been that it's really important to record the trace and train the model on the actual reasoning process as exemplified by the human. I've found a ton of value from that, but in reinforcement learning, is that reasoning trace so important? So, break down the roles, flavors, and major drivers of impact of human data today, as opposed to just doing the RL thing to the max.

Manu Sharma: Expert data continues to be extremely important for all these capabilities that we are about to see and have seen in the past. I think in one way it seems like we underestimate how incredible human intelligence really is, and all the world and economy that we have built, and all the things we do. This is extremely special, and we are trying to emulate it into AI systems in many ways. AI systems are superhuman in some respects, but they still cannot do many everyday tasks yet. The way I think of it is that in domains like mathematics, we've taken that slice, saying, "Here's an interesting, awesome dataset about the truth of how logic and reasoning work in a purely scientific domain." We can really train these systems to learn from that and adjust their critique and reasoning from that source. With that, we've gotten these state-of-the-art systems today. They're very incredible, but when you put them into products as an agent, there's a baseline established now. There's a reasoning behavior that is basically a foundation for all these amazing capabilities we want to build. To build these new capabilities, let's take an example of a sophisticated AI system. In the future, you'd be able to simply call any company. Let's say you want to change your Wi-Fi router or something like that. Previously, you would call a whole bunch of people, stay online for 40 minutes, and try to resolve an issue. Very soon, we're going to be interacting with purely AI systems. These AI systems will be agents where a concierge takes your call, routes you to the right specialized expert on how to solve the problem you have with your Wi-Fi router. Maybe the system then has to interact with the entire database to provision a replacement and things like that. It's a fairly sophisticated system that interacts with you by voice, but then it has the capabilities to read databases and navigate a variety of software stacks inside that company to help you as a user. To build that system reliably, there's an incredible amount of edge cases that exist because we are now working with tens of millions of users and different ways people could request these things. The base model is a starting point to develop these capabilities, but the architect of that agent or solution would need a very good dataset that represents all the distribution of use cases or interactions the system would have to go through. If you were to build that system today with a base off-the-shelf model, you're going to require a lot of orchestration on top of the base models. You would need a lot of software stack and agent frameworks, and you might still not get very reliable output. That's just one very narrow example, and there's probably an infinite number of examples and use cases like this in our economy. All these things are going to require datasets for these systems to be rock solid, superhuman, and super reliable. I don't see any other way to create these datasets. You can't really invent these datasets from algorithmic synthetic approaches. You can bootstrap it; you can make the process easier with all these techniques of compute or synthetic approaches, but you'll still need quality judgment from how the authors or creators of this system want the AI system to act, how they want the customer service to be like, what personality it needs to have. These things are design parameters, and the people who are going to build these systems have a choice on how the systems need to act on behalf of the company. So all of this requires new forms of data, whether it's voice interactions, tool use, and things like that. I think there was a period, perhaps starting last year, where a number of research attempts aimed to make reasoning traces. For example,

Nathan Labenz: So, the shift in both data and training, I'll try to summarize it back to you, and tell me if I'm getting it right: it's from a supervised fine-tuning mode where the human is responsible for both the final answer and how you got there, spelling out the reasoning, thinking step by step, and so on. The model then learns to imitate that pattern of thinking in hopes that learning the pattern of thinking will get it to the right answer at a high rate of reliability. That was the previous paradigm. The new paradigm is, maybe because we've found it really difficult to collect that data, or for some other reasons, instead we're just focused on getting a really good solution. Then, we develop a rubric for evaluating the AI solutions, and train with a reinforcement learning signal. Presumably, the AI takes its shot, and another, greater model essentially receives: Here's what the AI just did, here's the gold standard example, here's your rubric. Score the new one with the gold standard in mind against this rubric. The reward signal given back is the sum of points on the rubric or similar. How did I do on that? How would you elaborate?

Manu Sharma: That's right, actually. In many, and most, domains, you do not know the exact right solution. But what you will know from domain experts in that respective industry is how they would determine what is good, great, or excellent. That quality judgment has to be somehow expressed in the data for the systems to learn. A lot of the work is now going towards producing these kinds of datasets. You might have heard, speakers from AI labs often mention this. Some are doing interesting new work. You've talked in previous episodes about RL environments, which I think of as a gym. You get the model to go into this gym and practice a whole bunch of tasks, practice many things, and it gains a new skill. Think of any domain, and you want to model it in some form of an environment where these models can play and start honing that skill. The challenging part is crafting these RL environments to be representative of, sometimes, very generalized skills. For example, games could be more generic planning gyms, but coding environments or other environments are perhaps more specialized for certain domains and use cases. So, that's generally the way to go about it. Now, this also has a new set of challenges. To build these systems and make this paradigm truly work, you need very reliable graders, or auto-graders. These are specialized models essentially able to align with how an AI system would grade progress and align with experts and domain experts. So, there's another work stream dedicated to ensuring that this is rock solid. Once you have that, you can really scale up these RL training environments in a post-training regime.

Nathan Labenz: One little aside, and maybe you'll have other examples too, but I've always wondered about super long context. That strikes me as something that would be really hard to do. You mentioned it's hard to get people to record their thoughts. I've personally experienced that quite a bit because I've advised a number of companies and coached people on,

Manu Sharma: You're right. When it comes to extremely large context windows, how do you really verify or test? In other words, how do you create datasets that these models can use to express how good they are in those long context windows?

Nathan Labenz: Mm-hmm.

Manu Sharma: There are two broad camps that I've seen. One involves programmatically modeling a needle in a haystack. You create a whole bunch of hashes and see if you can find a particular hash among a large number of noise elements. This camp focuses on testing at a very fundamental level: can it find information in a vast ocean of data points? The other camp tests in the real world. We are actually creating many datasets that test these models' capabilities in long context. Take financial analysis, for example. It's a huge industry where one of the jobs is to synthesize a vast amount of information, like SEC filings about a company and all the information they may have released about their company's performance. This could include ten documents or audio transcripts of earnings calls. A top financial analyst wouldn't simply answer a question like, "Tell me a factual thing about where this is in 100 documents," which becomes a search problem. Instead, they would model the problem by saying, "Given all this information in different modalities, I want to accomplish a task. I want to analyze and create a model of the company so I can forecast its earning potential." That's very representative of a real-world task. Someone might take days to create those things. While models are good in many of these examples, there's more room for them to accomplish those tasks reliably across a variety of domains. This forces the model to understand and reason across multiple modalities and pieces of information, what we internally call multi-hop capabilities. There's some strong performance, but there's still much to be desired. This is one way AI labs are testing and honing the reasoning and planning capabilities across wide, large context windows. If you include video and other modalities, they consume a lot of token space because they have to be tokenized. Gemini, for instance, is very good at video understanding, identifying when something happened in specific frames. But when it comes to real-world applications trying to mimic or integrate into actual industry domain workflows, these teams still need to improve capabilities across those factors or domains. So, there is a way to model and test these capabilities, and the devil is in the details. We can induce failures across all frontier models on these types of tasks, and that data becomes very valuable for these AI labs to then use to improve.

Nathan Labenz: Could you describe the different modalities of interaction or data collection? I'll give you a prompt from my own experience. In summer 2022, I spent the entire summer fine-tuning the Text DaVinci-002 series of models. They were never actually released; we had an early preview of that product. What I found to work well for me, and there was no multimodality at the time, was for any given task, to do ten examples. At some point, I discovered the importance of the reasoning trace, but I didn't know that initially. So I would do ten, fine-tune on just those ten, then I would have 100 queued up. I would have the model do the next 100, and then I'd basically do rejection sampling or correction, specifically emphasizing the ones it got wrong. I would fix those, put them back into the dataset, do another 100, and essentially try to get myself to an acceptable success rate through two to 'n' rounds of iteration, depending on the task. It was initially a qualitative mode where I was totally on my own and had to do the task. Later, it became more about having the AI do the task, reviewing its outputs, finding its flaws, and correcting them. How has that evolved today?

Manu Sharma: In the context of fine-tuning these systems and models, you mean?

Nathan Labenz: My next question was about segmenting the market. Hyperscaler frontier labs clearly have one set of needs; they want to collect all available data from various experts to excel in everything. You mentioned financial services, and I also think of pharma and other specialist companies that I assume are more engaged in a fine-tuning exercise. We could add another dimension to that breakdown if it represents a fault line in data collection.

Manu Sharma: What you shared about the iterative creation of a small dataset, perhaps for evaluation, is essentially a holdout set from training data. That is generally considered best practice, and it's surprising how many teams or individuals across the industry get that wrong. Sometimes the hardest part in the enterprise space is to really think hard about the actual workflows or things they want to automate or build, and how to express that in a dataset for evaluation. It's emerging as a craft in itself to understand that trace or trajectory and express it into datasets. Generally, the best teams or successful outcomes start with something and iteratively add to that evaluation set. Like anything else, there are known knowns and unknown unknowns. More often than not, teams uncover completely new edge cases they never thought about, and you have to go back to the drawing board and say, "What if we did it this way or that way?" So, like anything else, iterative development is the cornerstone of success across the software industry. That's what makes software teams agile and successful in all these endeavors. I think that is very true in this new world where, rather than writing code, we are producing datasets and evaluation datasets for these AI systems to be evaluated and trained on. Fine-tuning is really interesting. I don't think fine-tuning as we used to see it will last in the future. What I mean by that is, in the supervised learning paradigm five, four, five years ago, most teams would take a base model and then specialize it. They would take data collected from their sensors or companies and fine-tune that model to make it superhuman for that use case. Now, in 2025, fine-tuning at best helps make the model efficient for whatever task you want to perform. For example, perhaps it's very expensive to run a very large model at a high frequency, say millions of queries per hour, for a particular use case. You can achieve very good state-of-the-art performance on that, but given that it is a fairly narrow task, you can actually produce data from that large model and then distill it down to a much smaller, specialized model. For those queries per minute or per hour operating parameters, there is a better, smaller model that will simply be better across the board. It will achieve the same quality at a fraction of the cost. So, that seems like one obvious reason to fine-tune models nowadays. Then there's probably a second bucket where some problems are truly so unique that base models don't have that capability, and fine-tuning is the only way to achieve it. However, if you look at the last few years, we saw companies like Google release the MedPaLM model, which was trained on healthcare-only data. I read somewhere that the base Gemini 2.5 models outperform a hyper-specialized model that was simply trained on healthcare data. So, what does that say? It's saying that the reasoning capabilities these base models learn from vast amounts of other data are actually useful for solving domain-specific problems. So maybe the idea that many enterprises or companies thought they had very special datasets and petabytes of data, maybe that dataset is not as useful because it won't help improve the reasoning capabilities of the models. You need broad capabilities for models to learn from and develop their reasoning behaviors. But perhaps fine-tuning might help you achieve that particular goal more cost-effectively with the things I just said earlier. We don't see as much fine-tuning across the board among our customers. What we do see is a lot of context engineering. Fine-tuning means you are truly changing the base of the model, but there are so many problems you can now solve simply through context engineering, which is emerging as a very specialized craft in itself. Prompt engineering emerged as a way to model a problem in a prompt, but I think the most effective systems we interact with today are actually very carefully context engineered. That not only includes the prompt, but also all the context you might give that model or query through retrieval mechanisms and things like that. That is not an easy thing to do, but it is tractable and becoming very effective.

Nathan Labenz: To clarify, you're suggesting that across high-value industries like finance and pharma, base models are so proficient at reasoning that it's more effective to use them. I'm also curious about the scale. For instance, with the financial analyst use case you mentioned, we could collect data and potentially fine-tune models. OpenAI provides this on their platform, and it's also possible with open-source models. I've wondered why Anthropic or Google don't offer fine-tuning in a similar way to OpenAI. It sounds like your suggestion is that, perhaps, we create 10,000 examples of excellent financial analysis. Then, at runtime, we select the best ten or fifty examples relevant to the current task and provide them as context. This approach of using runtime examples drives performance, and you can achieve equally good or even better results than through fine-tuning and changing model weights. Is that generally correct?

Manu Sharma: Yes, many of the agents we see in the industry now require context. Take code, for example. When interacting in Cursor, its effectiveness comes from the base model receiving all the necessary context, such as coding files, relevant snippets, and similar functions. The directory structure and all that information are modeled and sent to the base model for a coder to achieve their goal. This framework can be applied to any other domain: identify the task, gather all relevant information, and provide it as context for the model. We're seeing this is very effective. As technology advances, base models will only get better, and you might not need as much verbose context. Context efficiency will improve over time, but generally, the technology arc is towards foundation models absorbing capabilities that application layers are developing through context engineering. While it may take time for base models to become universally great, if you want to accomplish a specific task or goal now, the most effective way is context engineering. There are cases where we fine-tune models ourselves, typically when we need to optimize for cost or have a very unique opinion on how we want judgments to be made. For instance, in our industry, we work with millions of domain experts worldwide—physicists, mathematicians, Olympiad-level software engineers, across 70-plus countries. One thing we do is assess these individuals' expertise. Ten years ago, teams might have tried to solve this by creating tests, but that's not scalable; it requires manual crafting. What we do now is use an AI conversational system powered by frontier models. When experts interview with us, they interview with an AI. We have their resume and research information, and they have a 30-minute conversation about their domain. An AI system then assesses how well the interview went, judging if the person will be a great producer of datasets for teaching frontier models new capabilities. A base model might have its own judgment characteristics, which we don't always align with or find sufficient. To improve that judgment, based on the outcomes we see from these experts in the datasets, we have to close that loop. One way we do that is by fine-tuning these models, because that specific quality judgment capability doesn't simply exist otherwise. These are examples of where fine-tuning helps and its use cases. However, by and large, most of the industry I see across the enterprise space has moved towards context engineering. We see some fine-tuning examples, but it's largely context engineering across the board. In many ways, this is beneficial because the technology now enables enterprises to intuitively think about their most valuable workflows, express them as multi-step agentic trajectories, and build agentic architectures or systems. This might involve retrieval or multiple function tool uses and calls. In many ways, it's a software engineering task with domain task mapping or engineering. This is much more tractable for vast amounts of enterprises that may not have the skillset to be machine learning experts and delve into the nitty-gritty of model training. This is one reason why AI models and the industry have taken off so quickly: most of the world is now finding ways to implement it more frictionlessly.

Nathan Labenz: So, quickly on fine-tuning, since it may be a mode of attack on some of these even highly specialized problems, what models do you see people mostly choosing to fine-tune today?

Manu Sharma: I would say generally three models: Qwen, LLaMA, Mistral. Those are the three categories we see in the open-source world. I don't have much visibility into the private models, but I think it usually comes down to the nature of the business and the company. In highly specialized industries, perhaps the preference for these companies is to use open-source models because of operational constraints, either in air-gapped situations or wanting things on their own servers. However, if you go into different industries, for example, high-tech digital natives like Airbnb and Pinterest, they might actually be using state-of-the-art models from cloud providers or OpenAI.

Nathan Labenz: So, on context engineering, maybe you can coach me on a task I've been working on for a while. For every episode of the podcast, I write, and now Claude often does a very admirable first draft of my intro essay, which I put at the top. The basic approach has been pretty much constant. Since Claude 2 was not very good at it, I would do the same thing: I would take the transcript of the current podcast and a bunch of examples of previous intro essays that I originally wrote freehand. Then I just have a very simple prompt that says, "Adopting the tone, style, voice, perspective, structure, etc., of the attached example essays, write a new intro essay for the attached podcast transcript." What I've always felt a little strange about, or wondered if I could do better, is that for all those examples, I don't have the source. I don't have enough context to provide a bunch of transcripts and their resulting essays. So, it's not like I have the input and output pairs. Instead, I just have a bunch of outputs that I consider good, and then the current input. So, how would you advise me to refine my context engineering to get better first drafts? I still edit them for now, waiting for the day when Claude just nails it, but how would you suggest I improve that approach?

Manu Sharma: And where do you think the system is currently lacking? Do you think it's not producing good quality, or do you expect it to do other things that it's not doing yet?

Nathan Labenz: That's a great question. It obviously writes very nicely. It generally has no trouble following my structure. I feel like it gets me on the cadence. I have this persona, which is genuine, not an act, where I'm both very enthused about the AI's upside and legitimately fearful of it getting out of control. It usually does a pretty good job of picking up on that and finding that balance point. I'd say the things it doesn't do super well are, and this is where I have this lingering idea that if I could give it more of the inputs for those previous outputs, it might help, but I don't have the context to do that. It doesn't seem to, in some cases, pull out what's actually most unique and interesting about the current conversation. It gives me something that's fine, but in terms of how I really want to tee this up and frame it for people, why is this relevant now? What is most interesting about this? How does this relate to broader understanding? It's that kind of stuff where I'm often like, 'Not quite doing it.' Sometimes I get better results by... Often I get better results these days. I sometimes elaborate on the prompt as well. So I'll just say, in addition to adopting style, tone, voice, whatever, I'll say, "In this case, I want to emphasize," or, "I thought what was most interesting about this was..." I try to bring to the forefront whatever. And it can follow those instructions reasonably well. I still don't typically feel like it quite hits the heart of the matter a lot of the time. Obviously, that's a pretty subtle thing, so it's amazing that they've come this far. I'm certainly not taking it for granted. But yeah, that's maybe the best I can describe impromptu where it's not quite hitting for me.

Manu Sharma: It might be that you have to stay persistent in context engineering, asking, 'What are the examples where it didn't get it right, or it lacks certain things, or has gotchas?' Maybe there's a better way to express that in rules and such. But in a use case like this, I would argue that it is much better for you to continue with context engineering and try different base models. Perhaps try different ways to acquire new information or knowledge with the tools and so forth to get a transcript, and perhaps background of the people. The reason is that every couple of weeks you will see new models and basic capabilities emerge, including now that we are seeing assistants emerge where they have memory as a feature. Things that you might have liked or not liked in that instance, it will be able to capture automatically within the context. Maybe things are not great right now, but it will continue to get better as the base models improve. Versus, let's say, if you were to architecturally fine-tune something right now, you are basically freezing an investment. You have to do it right now, and you have to continue doing that as the base models get better and better. One thing for sure is that the base models are improving very fast across reasoning and empathetic aspects. Even though we saw that with OpenAI's 4.5 model, it is remarkable in many aspects that are not necessarily reasoning. So, maybe that is the missing element: creativity and that sheer touch of humanness. I think there's a macro question: do you really want it in this domain? This is a very creative domain. I think what makes your work so unique is that you are the editor-in-chief in this particular workflow. And maybe you do want to remain editor-in-chief, and you just want the models to be set up correctly to help you do your best work for the things that make you very unique. In these cases, perhaps the outcome is not a fully automated situation. But there are cases in completely different domains where it's mundane work and you just want it to be fully solved by something. In those scenarios, maybe fine-tuning might be a better technique to solve that problem.

Nathan Labenz: That's interesting. I'm not looking forward to being replaced by AI, but I at least want to know when it's happening. I do expect NotebookLM and similar tools to compete effectively with me in the not too distant future, if not already. This is in an asymmetric way; I recognize there's hopefully an appeal to a personality that you get to know over time and watch evolve. At the same time, NotebookLM can take any topic you want to have a podcast on at the drop of a hat, and I certainly can't match that. It will be interesting to see which relative strengths and weaknesses win out. I often think, and I have a friend who really emphasizes this point, if I just read the Claude output, I'm not sure anybody would really notice much difference or think it's any worse. So to some degree, I am really just precious. I have a sense of what is me, and what is me matters to me. It matters because I'm reading it and putting my name on it, and I want it to be genuine to me. But that's a different question than what is quality or what serves the audience best. I can confidently say that when I edit Claude's output to make my own final thing, it is more true to me. I cannot confidently say, at this point, that it is truly better serving or better informing the audience. I haven't really run that test. I hope it is, but it's not entirely clear. That also might fall under the category of, do I really want to know? So far, I haven't run that experiment, but it would be an interesting experiment to literally sometimes just read the Claude essay and see if anyone comments that it was worse. I'm interested in comparing and contrasting frontier developers, what they're doing with data, and then everybody else. You can sub-segment that if you want. My general sense is at the frontier, they want as much data as they can get, as many domains as they can get, with the highest value and highest quality they can probably get. I've heard numbers of $300 an hour being paid out to physicists or biologists around the world to do this sort of stuff. When it comes to the 'everybody else' layer, which can still be very sophisticated companies, they obviously have a more narrow focus. I'm interested in what else is different. Do they, for example, primarily use their own team members to do the data work, or are they still interested in going outside? If you're a pharma company, my naive sense would be, 'I'd want my own people to do this stuff as opposed to having you source talent around the world,' but maybe that's wrong. I'm also interested in the scale. Do they have their own people do it? Do they still want to go outside? How big of a dataset do you need for something like that? How do you know? Are there rules of thumb or guidance you can give people? How much does that cost? In a way, this could be your intro sales pitch or orientation for a new customer. But yeah, I want to get the lay of the land, both at that top tier, which I think I understand better, and especially at that 'everybody else' tier.

Manu Sharma: In Frontier AI labs, we work with almost everyone at this point. There is an insatiable appetite to gain access to new, novel, specialized datasets within the realm of reinforcement learning. This helps them teach models longer horizon tasks, doing so reliably across a variety of domains, including mathematics, coding, and other areas. That is certainly the case, and it's actually getting more in demand over time. The budgets across the Frontier AI labs for data are increasing. To give you a sense, each Frontier AI lab is probably spending over a billion dollars a year on data, so that is fundamentally increasing across the board. These Frontier AI labs are trying all sorts of different ways to produce these datasets. I think some teams try to hire contractors directly, leading to the emergence of staffing agencies that we are seeing. All they do is simply help them hire domain experts very quickly in bulk. Staffing agency models have become very exciting for investors in Silicon Valley nowadays because they are now active in the AI space. But more often than not, AI labs ultimately require or want really high-quality data delivered to them very quickly. To produce that data, you have to operate a data factory. Labelbox is essentially a data factory. We are fully verticalized, so we have a very vast network of domain experts, and we use excellent methods to screen and vet these experts. However, that is just one part of the story. You have to actually build tools and technology to then produce these datasets. Today, the state of the art is that most datasets are hybrid, where we have to use all of these amazing techniques with AI systems and synthetic dataset approaches to infuse it with human experts to produce those novel datasets. To your question, we are actually going to release a study in the coming weeks. We looked at our network and the earnings across the board. Guess what the yearly earning is for a top quartile or a top AI trainer. What would you guess they are making per year?

Nathan Labenz: Am I assuming they're working a full-time 2,000 hours?

Manu Sharma: You can assume that.

Nathan Labenz: The $300 an hour that I quoted seems high, but $100 an hour doesn't seem high. So, if they're doing a full 2,000 hours, that would be $200,000. I'll discount that because they might not get that many hours. I'll put them at $140.

Manu Sharma: That's a good guess. Our top contributors are earning well north of $250,000 a year. There's a power law, and you see it asymptote at around $40,000 to $50,000 a year for other domains and countries. However, the best, most highly specialized individuals are earning over $250,000 a year. We actually expect it to increase as the AI frontier goes deeper into agents and business workflows. This is amazing to see. Just five years ago, the average hourly rates for datasets were much lower. We were operating in a very different regime and domain at that time. Now, you essentially have domain experts, some working part-time on these AI training tasks, and some are changing their lifestyle to do this full-time, with the freedom to work whenever they want. That is the state of human data in expanding the frontier. These are extremely complex tasks, not necessarily tasks a person would do for five minutes anymore. These individuals are actually creating RL environments. This is an endeavor to develop data solvers and verifiers for these environments, or to play in these environments to produce an instance of what a good activity would look like. We also have a very interesting, very big business in the enterprise space. We are one of the leaders in building software tools and platforms for producing training data. We have customers like some of the biggest pharmaceutical companies, or robotic and medical imaging companies. In those cases, they are developing models where they have to work with their own experts. Medical coding is a great example. It's something I think will be solved with AI, and I'm excited to see companies doing that. In these scenarios, you must have individuals who are true experts in that domain. It is such a nuanced situation where insurance codes and medical codes, particularly in the United States, are an industry in themselves. You have to tap into those individuals from that industry or train new experts to do that task really well. I'm sure the companies that are starting to do really well and build these AI systems are finding all these leverage points. They understand the nuance of how it's done and are probably trying to train a larger workforce to actually do that in a factory sense. In these scenarios, these teams have to develop the entire data factory themselves, if you will. They have to build the technology and the tools, then they have to figure out who these experts will be, and how they will work in the factory to produce these datasets. Labelbox operates a large-scale data factory, primarily for frontier labs. For enterprises, what we do is offer them a technology platform to do it themselves. In those scenarios, they are using their domain experts to run the data factory themselves.

Nathan Labenz: Do you have to give them a Palantir-style forward-deployed expert to help them develop the process? I assume they need that help in most cases.

Manu Sharma: In some cases, yes. In other cases, no. It really depends on the sophistication of the customer. If you're a super high-tech digital startup, an AI startup, or an application startup, perhaps you want to have full control and architect and operate the system yourself. But in some other industries, you might get more help from our company.

Nathan Labenz: That makes sense. One other random market segment, I don't know if this really exists, but I just did an episode not long ago on the concept of sovereign AI. There are many different facets of what that might mean. One big one is countries around the world with different languages, cultures, and value systems. They might want to invest in whatever they can, and it's not always entirely clear what they should do to get models to be more fluent in their language, more familiar with their values, respond in more culturally sensitive and appropriate ways, and have more local knowledge. Whatever the dimensions, they'd probably like to improve all of them at once. I wonder if there's any business there for you. Whether there is or isn't, a theory I've had in the past is that if you are Brazil, for example, or even India, although India is big enough that they might want their own national champion, maybe what you should do by default is just make an investment in data collection, and then give that data to the frontier developers. You could say, "Hey, you can be better in Brazil. We've done the data collection work as the Brazilian government. Here's a big data dump, please use it." Then they won't have to worry about data centers, or frontier researchers, or how they're going to afford the $100 million to compete with what Zuckerberg is reportedly offering people, though I'm a little skeptical of that number, except in the most rare cases. Is there any sovereign AI demand coming your way? If you were to advise the president of Brazil, Mexico, or India, what would you tell them they should do on the data front?

Manu Sharma: I believe there is a tremendous opportunity in government to essentially leapfrog or overhaul entire ways of doing things, shifting to this new way of making it super efficient. Different countries have different systems. In some countries, the government provides certain services and owns the full stack of those services itself. In others, it's a more privatized model where they create the environment and specify, "We want private companies to be able to do X, Y, Z." In scenarios where governments are developing or have been in the business of offering services to their citizens—which I think is most governments—in some shape or form, there is a lot of opportunity to rethink what an AI-native experience would look like here. Across all dimensions, this would mean they could render those services with a much better experience, at a fraction of the cost, and move beyond their legacy systems. Finally, we have the technology to allow them to do that. Many of these governments are probably running on legacy Fortran and COBOL. I believe I learned from the DoD team that they uncovered a massive amount of things still running on legacy systems. So there are tremendous opportunities for rethinking experiences in a pure digital-native, AI-first manner. Think about India, where so many citizens could benefit from basic knowledge, understanding how things work, what amenities are available, and what services exist in different towns and villages. Now, I would argue that governments should focus on the outcomes, on the experiences they want, and then map that back to how to achieve it in the most efficient way. This could involve letting private companies develop and implement solutions, or in other areas, based on a particular use case, they might be re-architecting or developing those services themselves. In many cases, it becomes a question of how to get the three components—compute, data, and talent—together to build those capabilities. You certainly need all three to make it effective.

Nathan Labenz: So no shortcuts, in other words? You don't think they could just come to you and say, "Hey, we want to go..." like data on a silver platter for the frontier developers. Where does that break down in your mind?

Manu Sharma: There are many industries and players in our market who would say, "No, let's invest in the fundamentals, and we'll figure out the use cases later. You'll have to get your data in the right place," and things like that. I'm not so sure that actually helps these teams, or entities, achieve the outcome. Rather, let's walk backward. Finally, we have a technology that can be wielded and molded to achieve the goal. The hardest part is coming up with that intention and goal, outlining what a great experience or service would look like, and funding that. Once you have that clarity, all the other things are fairly tractable. We can help them produce the right datasets, build end-to-end systems, and evaluate those things. I think more often than not, projects fail because they didn't have that end goal in sight, rather than acquiring a whole bunch of technologies and seeing what they could do, instead of walking backward from the customer experience.

Nathan Labenz: That's a great reminder, and something I always emphasize in my ad hoc AI consulting for businesses: What problem are you trying to solve? There are so many amazing new technologies you can experiment with, get lost in, have tons of fun with, and spend a lot of money on. Yet, if you don't have a clear idea of what you are trying to accomplish, it probably won't go well for you. Sometimes people say, "We need to build a platform first, then we can build applications. We need to build our AI platform." I often say, "Let's do a spike. Let's get one thing working first. We'll learn a lot of lessons that way. We'll learn what data we do and don't have." There are all sorts of problems we can identify in the course of one spike. Probably all that gets thrown away, and then, at some point, maybe you mature into a more robust platform. But if you try to build that without specific problems in mind, you set yourself up for a lot of frustration. So, I completely agree with that analysis. I do give the frontier developers a lot of credit because I think they have taken steps to close these gaps over time on their own. One notable one was when GPT-4o dropped; the tokenization of non-Roman alphabet languages dramatically improved, which brought down cost and increased speed. It allowed more to fit in context windows. So, they've definitely prioritized this to a degree. But there's still a way in which it feels like if you're in India or Brazil, you are a second-tier user because things are so English and American-centric in many ways. If you could just do a giant data dump, maybe the frontier developers would take it, and it might lead to more parity between the American experience and the Brazilian or Indian experience of using the models off the shelf. But who knows? I don't know.

Manu Sharma: It's interesting to see how it all merges. Before Labelbox, I worked at Planet Labs, a now public company that operates about 400 plus satellites in low-Earth orbit, scanning the entire Earth every day. I was involved in developing capabilities where we would use computer vision to extract insights from this dataset. For example, the Government of Brazil would be very interested in understanding the level of deforestation they are seeing every year. There was simply no other way to truly understand that on a daily, monthly, or weekly basis. But as a company, we could develop that insight because we scanned the entire Earth daily. We would get images of the entire Brazil every day at a certain resolution, apply these algorithms, and essentially feed that as an insight to the government to make the right policy decisions. That's an example where only certain companies can do that and share that insight. It wouldn't make sense for Brazil to operate foreign satellites just to solve that problem. So, there are categories of capabilities or use cases where you have to get the private companies that are best at what they do, bring those capabilities, and then work on integration with other systems to make whatever policy decisions or services they provide to citizens. In some cases, governments are uniquely positioned to curate or cater an intended experience to their citizens. What could be an example? Perhaps the experience with taxes could be an interesting one. Can governments offer a much more intuitive way to file taxes? Or, for countries with a centralized health service, can they make it super AI-driven where citizens can query, understand, or get basic healthcare services conversationally? It really becomes case-by-case: what capabilities can the government invest in and completely rethink? In other areas, they will have to rely on private industry to offer these solutions. I think we seem to be going in a direction where there will be foundation model companies or models developed privately, or open-source versions, and they will be used as a base model in the sovereign context. They will essentially either develop applications on top of it that are unique for sovereign use cases, or in some cases, they will have to fine-tune it or develop custom models on top of it because the government has the scale and intent to produce or gather specialized datasets that companies might not be able to.

Nathan Labenz: Why has nobody offered to pay me to install software on my computer to watch me use my computer all day? It seems like one of the obvious relative deficiencies right now of the frontier models. OpenAI's operator is certainly getting decent.

Manu Sharma: Yes.

Nathan Labenz: But you watch the AI Village, and you've got a lot of stumbling around in various nooks and crannies of UIs. It just seems like this race for data acquisition has not extended to, "Let's just go watch people use their computers, record that, and then bring that behavioral data into the fold." I've never heard an answer to this question that has satisfied me. So, what's your take on why nobody's made me that offer?

Manu Sharma: It's coming. I have clear visibility into this because we power many of the computer use agents that labs are building right now. You are seeing the first versions of truly useful computer use agents or capabilities. It's really just the first innings, like inspiration from the movie Her, where these systems are becoming an AI companion. They can listen to you, observe what you're seeing daily from your phone's camera, but if you're working on a computer, you could turn that on, and it would know everything you're seeing on the screen. These models are very poor in that capability right now. If you were to turn on a live AI model that can understand these things on your screen, the understanding piece is good, particularly for some areas of UI or text. But when you're doing interesting or domain-specific work, they often fail, and they will fail in unintuitive ways. They might fail three minutes into a session because the model didn't understand or remember what was said or the intended goals from the first minute. There are failures across the board, whether it's understanding a screen at a time. For example, most models don't really grasp shapes of geometry and have very poor spatial understanding. They also fail in the time horizon. As part of the rollout of these AI systems, I wouldn't be surprised at all if the freemium model would allow them to use sessions and datasets to perhaps train and improve the models. A lot of work is happening through companies like Labelbox, where we are producing these sessions and datasets at a very large scale across different languages and domain expertise. In a way, you could say the capabilities are being developed; they're not rolled out yet. But as part of the product experiences, companies will be able to take data from users and improve these systems, making them more reliable. I'm sure they will offer privacy choices, and some users will intentionally allow their sessions to be used by companies to improve their models.

Nathan Labenz: Do you think I have any expertise that would allow me to be a data contributor via Labelbox? I wouldn't aim for 2,000 hours a year, but I wonder if I even have, because I'm a highly generalist jack-of-all-trades person. Is that profile even-

Manu Sharma: Yes.

Nathan Labenz: ...useful anymore?

Manu Sharma: Absolutely. There's a lot of talk in human data where we take examples of doctors and lawyers. I actually don't think it's that helpful. It's helpful for a new audience to understand we're looking for domain expertise, but internally at Labelbox, we simply look for individuals with high agency and IQ across the board. That fundamentally correlates with being able to do very generalized tasks, even in certain domains. So you would be able to learn completely new tasks, maybe in coding or in some research, where you're able to do long-horizon tasks far better than any of these models can do right now. That means you can actually produce the training data or signal for these models to learn from. So for sure, absolutely. I think I'll let you in.

Nathan Labenz: All right. I'm looking forward to experiencing the AI interview as well, because I've heard about a lot of those things getting at least experimented with. It sounds like you've scaled yours. While we're on that, are there any lessons from the AI interviewing at scale that you would highlight for people?

Manu Sharma: Yes, so we probably are, I don't know about the others, but we believe we are one of the largest rolled-out production AI interviewers right now. We are conducting well north of 2,000 sessions or interviews a day. Right now, probably 50 people are interviewing with our Zahra AI interviewer. There are a lot of lessons. First of all, we were very surprised how much people loved or enjoyed interacting with an AI about their experience and so forth. In many ways, when you look at the average rating when we ask our contributors, "How did the AI interview go?" they report 4.6 or 4.7 average satisfaction scores across the board. That is because, A, they can do the interview at their own time, whenever they feel they're in the moment to have that conversation, and B, it's incredibly patient. You're able to just dive into a variety of topics, and in many ways, it will ask you, if you, let's say, have a PhD, about your research paper, about a very nuanced thing that a human recruiter may not even be able to get to in a first 30-minute session with you. People love sharing their experiences, especially if they can share aspects of the greatness of their work with that fidelity in a very condensed manner. That's one thing. Then people are using it to practice their real-world skills. We have just rolled out practice sessions of our AI interviews where people are simply using it as a way to do interview prep for other things. We rolled it out with the intention that our goal is not to go into that category or market, but our contributors asked for it, saying, "This is so great, I want to be able to just do practice runs for other things in my real-world jobs." I also find some really clever ways people try to cheat the system. I've certainly seen people literally just put an iPhone with a ChatGPT Advanced voice and have AIs talk to each other and things like that. We see this human ingenuity in gaining a system, and I take that in a positive way. We have to understand that we must build technology and systems to stay ahead of that. If you have that type of system, it means we might not, we have to be really good at assessment and identifying really solid, good players with the right intent. That is always a pursuit because human ingenuity is just so vast, and you get surprised by ways people employ all these different tools and techniques.

Nathan Labenz: Yes. Interesting. I'm looking forward to seeing what that experience is like and how helpful the AI cheating assistant is.

Manu Sharma: Well, another thing I was very surprised about with the AI interviews is how good it is, not only at going deeper into your context, like your resume, experiences, or papers, but also how good it is at having conversations in all these different languages. It is actually able to assess how natively fluent you are in a particular language. So we are able to assess all of those things: your different language skills and your domain expert skills. I was honestly very shocked to see how-

Nathan Labenz: Are you using-

Manu Sharma: ...good and effective these are.

Nathan Labenz: ...OpenAI voice for that, or something else?

Manu Sharma: We use multiple providers behind the scenes, and we've also developed fine-tuned capabilities within our system, especially around assessing conversations and grading them. These features are all part of a data engine we've built. As we collect more interviews and receive more feedback about how effective our human experts are at producing data, we use those insights to improve the grading system.

Nathan Labenz: Yeah.

Manu Sharma: And actually, this applies to the entire format of interviews and similar areas.

Nathan Labenz: Cool. I'm looking forward to checking it out. Going back to our earlier topic for a moment, what do you expect for the future of the industry? Is this a unique situation where Zuckerberg is doing something unusual, or should we see this as the first of many partnerships between innovative developers and data factory companies? What do you predict? I understand you have a lot of experience in the field, so you may not want to speculate too much, but...

Manu Sharma: Right now, one way to look at it is that we're segmenting different aspects of human intelligence and replicating them in AI. The question is, how many more segments are left in the breadth of human knowledge? I would argue there are probably infinite segments because we still don't fully understand how the human brain works. We haven't seen systems that operate like the human brain yet, but I think we're getting closer. With each paradigm shift, from supervised learning to now, we've consistently found that data is a critical ingredient for building highly capable models. Now, we're teaching models very sophisticated knowledge work, and while they're very capable in some ways, they're still limited at performing real-world knowledge tasks. The question is, what will it take to achieve that? In the reinforcement learning (RL) paradigm, we're exploring how to create and emulate a variety of domain-specific tasks using RL environments or datasets so these models can learn. There's a long road ahead to make AI systems more capable, integrated into our daily lives, and reliable over the long term, and all of this requires data. Exciting things are happening in synthetic data domains, and that's always going to be a focus because of the potential benefits. Still, you need human judgment to ground synthetic data, because AI has to be useful and interact with people. We're in exciting times, and I believe data will continue to be how humans supervise AI models. We want to remain in control, managing millions of AIs to complete tasks. Managing these AIs will rely on data, and LabelTickle aims to keep growing as a major data infrastructure provider for both current and future use cases.

Nathan Labenz: That's a great place to end, but I have one more question. Where do you see yourself regarding the prospect of reaching an AGI plateau versus a sudden leap from AGI to superintelligence? There are arguments on all sides, but some of what you mentioned, like positive transfer or building a human-level coder that's as good as the best and can operate at scale, makes me think we could see rapid progress. You could quickly uncover new architectural innovations, and everything could accelerate. On the other hand, hearing you describe all the different segments that require data makes it seem like we might instead reach a plateau at human expert level, which might actually be safer and allow time for society to adjust. What do you expect? Do you think we might see an extended period at human-level expertise without a sudden rapid takeoff?

Manu Sharma: So I think in a way both things are actually true. I think we are already in an accelerated takeoff by all means and our, we get adjusted like with every day-to-day like, hey, why are all these AIs not so great right now yet? It hasn't done my work yet, but five years ago you would ask me, "Man, this is, this looks like some crazy capability that, you know, from very far future." And so in many ways like right now we are in a fast progress. I think the future is going to be very much more synergistic with hundreds of... Think about it like there are 30 million developers now or something like that and I think we're probably onboarding billions of coders over the next few years with these AI systems.And so that's really interesting and exciting. That has its own properties to it. So there's that. The, the question, the fundamental question that I think... which kind of also teases out in the philosophy realm is, how can these AIs ascertain quality judgment? And there's a lot of things about data, like world, like humans are able to figure out and able to put a quality judgment. And in many, many ways, or scenarios, we're not able to necessarily express it why a certain thing is so good or bad, but we know when we see it. And it's this visceral, uh, part or emotion that appears from that experience of reality and human mind. And, like, these are the concepts explored very heavily in, uh, uh, books like Zen and the Art of Motorcycle Maintenance, which sort of brings a lot of these really interesting ideas about quality and so forth from other philosophers and so forth. I think the question comes down to that, like how, like, that quality as- judgment will be imparted into AIs. And so far, the way we are doing it is with humans teaching it in, in all these meta ways. These are ways that we are, you know, rubrics or grading, things like that. And I'm sure a year or two years from now it will be very different techniques. But I think that's an open question. I'm very curious about that. I think a lot about that, obviously. And that would be really key for kind of a, uh, I would say, like what you described as the sort of even faster progress. If you somehow can magically, like, figure that out across vast knowledge of... vast space of human knowledge, what makes things excellent, good, bad, and it is gonna be key for making these AI systems, like, really progress very quickly.

Nathan Labenz: Yeah. Taste, in, in maybe a word, is the domain to watch.

Manu Sharma: That's right.

Nathan Labenz: This has been excellent. Is there any other closing thought or anything we didn't touch on that you wanted to leave people with?

Manu Sharma: No, thank you for having me. It was an honor to be here.

Nathan Labenz: My pleasure. That's very kind. So thank you very much as well. Manu Sharma, founder and CEO of Labelbox, thank you for being part of The Cognitive Revolution.

Manu Sharma: Thank you.