The good, the bad, and the future of AI agents

What this episode covers

This is Hayden Field, senior AI reporter at The Verge and your Thursday episode guest host. Today, I’m talking with David Hershey, who leads the applied AI team at Anthropic. I wanted to have David on because earlier this week, Anthropic released a brand-new AI model called Claude Sonnet 4.5 that’s been making waves. So I wanted to sit down with David, who spends a lot of time testing out what modes like Claude Sonnet 4.5 can and can’t do, to ask him where we are on this promise of AI agents, and also what the path forward looks like as agentic technology progresses. Links: Anthropic releases Claude Sonnet 4.5 in latest bid for AI agents | The Verge ChatGPT’s built-in Buy Now button has arrived | The Verge OpenAI really wants you to start your day with ChatGPT Pulse | The Verge Anthropic’s Claude AI is playing Pokémon | The Verge AI agents are science fiction not yet ready for primetime | The Verge Agents are the future AI companies promise and need | The Verge Amazon is betting on agents to win the AI race | Decoder Credits: Decoder is a production of The Verge and part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt. Our editor is Ursa Wright. The Decoder music is by Breakmaster Cylinder. Learn more about your ad choices. Visit podcastchoices.com/adchoices

of MATCHES

TRANSCRIPT · AUTO-GENERATED

With Finn, we built the number one AI agent for customer service. It solves up to 90% of queries for businesses, tops all the performance benchmarks on the G2U board, and it comes with a million-dollar guarantee. Check it out at Finn.ai. Support for the show comes from Odoo.

Running a business is hard enough, so why make it harder with a dozen different apps that don't talk to each other? Introducing Odoo. It's the only business software you'll ever need. It's an all-in-one, fully integrated platform that makes your work easier.

CRM, accounting, inventory, e-commerce, and more. And the best part? Odoo replaces multiple expensive platforms for a fraction of the cost. That's why over thousands of businesses have made the switch.

So why not you? Try Odoo for free at odoo.com. That's odoo.com. Support for the show comes from Odoo.

Running a business is hard enough, so why make it harder with a dozen different apps that don't talk to each other? Introducing Odoo. It's the only business software you'll ever need. It's an all-in-one, fully integrated platform that makes your work easier.

CRM, accounting, inventory, e-commerce, and more. And the best part? Odoo replaces multiple expensive platforms for a fraction of the cost. That's why over thousands of businesses have made the switch.

So why not you? Try Odoo for free at odoo.com. That's odoo.com. Hey there, and welcome to Decoder.

I'm Hayden Field, senior AI reporter at The Verge and your Thursday episode guest host. I'll be subbing in for Neilai for a couple more episodes, and I'm excited to keep diving into the good, the bad, and the questionable in the AI industry. Today I'm talking with David Hershey, who leads the applied AI team at Anthropic. David works with startups to help them figure out how to best apply Anthropic's tech, plus he tests new AI models to understand their limits.

I wanted to have David on because earlier this week, Anthropic released a brand new AI model called Claude Sonnet 4.5 that's been making waves. For reference, Claude is the Anthropic what Chat2PT is to open AI. This new model, Sonnet 4.5, is being built as a big breakthrough in autonomous agentic AI, especially for coding purposes, which is a big battleground in the AI market right now. All these companies want to get a slice.

These types of AI products can, in theory, be given complex tasks and then go off and complete them over the course of many hours, or in some cases even multiple days. And Anthropic says this particular model, Sonnet 4.5, can run for up to 30 hours straight without any human intervention, all while working on a singular task like building a software application from scratch. For the last year or so, companies like Anthropic, Microsoft, OpenAI, and others have been promising that this agentic technology would be the next phase of AI, the next big hype-filled thing that comes after general-purpose chatbots. They say it could really unlock generative AI's potential, and it's true they've made some strides.

But as we've seen so far, agents aren't quite there yet, and they have a ways to go. Most of us are not, in fact, sending agents off on the internet to our bidding, and we're certainly not giving them tasks that might take 12 or 24 or even 30-plus hours of what time to work without human handling, at least not yet. At the same time, many companies are looking at agents as the breakthrough that's supposed to unlock huge productivity gains from AI models, including the opportunity to use them to replace or augment human labor. So I wanted to sit down with David, who spends a lot of time testing out what models like Claude's Sonic 4.5 can and can't do, to ask him where we are in this promise of AI agents.

I wanted to talk about what these types of products are good at from a consumer standpoint, beyond just programming purposes, and also what the path forward looks like as AI agents progress. Okay, here's Anthropic's David Hershey on the state of AI agents. Here we go. I wanted to ask you about your view of the current state of play for AI agents.

Like, we hear all the time that agents are the next big thing for generative AI, but are we still in the prototype stage, the testing phase, or what? How would you characterize what AI agents do today, right now, relative to what AI companies actually want to offer in the end, which I hear from a lot of execs is basically Jarvis from the Marvel movies. I'm less confident in our ability to predict the end, but I'm happy to talk about it now. I've seen agents come along ways in the last year working with customers, and my view is there are places where we're starting to see what it looks like when agents work really well, and there are still a lot of places where they don't work really well, and that kind of makes it confusing.

I think for some people, code is a great example. When you're writing code, and especially with Sonic 4.5, when you watch the model spend a lot of time as an agent developing software itself, it's incredible. They can do a ton. It's gotten much better literally this week.

As we release models, it's really visible and obvious how much better they're getting, sort of if you're plugged into that, at being an agent doing really long-running complex tasks. When you look at other sections of the economy or other jobs that people want agents to do large pieces of, sometimes there's stuff they're still not good at. In some cases, they're not good enough at deciphering what's on a computer screen to be able to navigate complex UIs or whatever it is, and so they fall over and stumble over themselves on something sort of silly, and it's easy to point at that and say, what are agents all about? It's just kind of fluff or hype or whatever it is, and I think the way that I see this sort of generally happening is we've slowly been ironing out the kinks of stuff that models fall over themselves on, and we've done that the best so far in coding.

I think the industry has done that the best so far in coding, where we're making really fast progress on how much an agent can accomplish when writing code, and we are, I think, starting to make that progress in a lot of other domains. For example, I talked about clicking on UIs at a computer. This model is not 4.5, just way better at that. I don't know if it's exactly the point we're going to tip into people, trusting it to automate all the stuff they do when they're clicking on the browser yet, but we're getting there, and so I think my general view is we're making really fast progress.

It's just not necessarily visible in every part of the economy yet, and in every job, and every person, and every individual. I have a feeling each model that comes out will get one bit closer to it being something that everybody can sort of interact with and see. So it's great at developing software. It's great at coding.

It kind of reminds me of the robot hand moment, you know, how robots can do things that are really, really complex and hard for humans, but actually grasping something has always been a real headache. What about consumer-facing stuff? You talked a little bit about UIs, but what do you think agents right now are the absolute worst at? What's the simplest thing that they truly just cannot do?

I honestly sometimes struggle to put my finger on this because I think it's surprising little funky things and a lot of different agents. So, for example, if you're trying to do something related to finance, maybe there's like a bit of manipulating a spreadsheet that is really hard and it falls over. And so it's like 99% of the stuff it can do, it can do the math, and it kind of understands how a finance model works, but it will stumble over a spreadsheet. That's the thing we talked about.

I don't know if there's like, I think it's hard to boil down all of the jobs that we want to help people do is one tiny little thing that the models aren't there. If it was that easy, I guess we would probably be on top of it already. I think maybe a better model is for each of these different things that we want models to help us with. There's like a million little components to break stuff into.

You need to be able to see the right cell on a spreadsheet and know how a formula works and know the economic model. And thing by thing, you can find in each other tasks a little thing that's not quite there yet. And so as we think about it, coming from a graphic, when we think about it, it's like we just have to be able to iron out like fix each one of the little gaps in each of these to help everybody use it. But honestly, this is probably done as fighting as if I can't quite put my finger on it.

There's just like this one thing, I think this is why this field is hard where we're just sort of like constantly working on the whole universe of the stuff people do in our computer and trying to help them out. And that just means it's a really wide scope. I have two versions of this, but I'll give you the fun one first. I think like one of the domains that surprised me the most in my personal customer work is the legal domain, which is at some value, at face value, it's apparent why that can be really useful.

There's just all this information you need to know about case law and studies and it's just fusing information and something that's like pretty obvious when I was like that. But actually there's so much complexity of the legal field, which I didn't appreciate. I'm not aware I didn't appreciate until I started working with people in the legal domain. I used to write a lot about legal AI and how the law sector was kind of the last to adopt a lot of AI tools because they were super old fashioned.

I did a couple of pieces on that and how quickly it came. Yeah, that's exactly sort of what caught me off guard was you would think, and honestly there's so many things that make it hard. Writing a really good legal system, you typically need a lawyer to tell you if it's good or not and having to have lawyers in the feedback loop of how to build is challenging. And so I'm really impressed with a lot of companies I've worked with and their ability to sort of work around that where they have this interesting build of companies that have lawyers on staff to help with product building, like big amounts of lawyers on staff to help them build products.

They build agents and cool things so people can comb over case law and look for the right details and pieces and then obviously use the stuff that AI is obviously good at of synthesizing answers at the end. But how quickly that field, I think it's just like, it's probably like the scale of the upside there of how much just the pure volume of work that needs to be done and how much it can help that has driven the speed that people go. But that domain has surprised me. I promise you a two-part answer.

My second part of this answer is I think the funny thing about this space is it's really hard to guess where the next agent is going to take off because of that thing we talked about before where it's like sometimes there's just like one little tiny thing that's not super obvious to any of us that's blocking an agent from working. So like if you want a model to do your taxes for you, I think that'd be very nice. I don't look forward to doing my taxes every year. And you just find out that like actually what it's really bad at is you upload your W4 and it can't quite see the difference between the two boxes on your W4 and that's why the whole thing doesn't work, you know?

This is a toy example. I don't think that's actually where models are, but. Yeah, that would be pretty complex. I remember when I worked in two states or I had a job in one day and I lived in another.

It was also really hard for me to figure that out. So not surprising that AI can't either. Yeah, it is complicated. But like I think we can get that.

Like that's clearly, it feels in the domain of what the models can do. I've seen them be much more complicated. I know they're better than me at math by a decent amount. So I would just expect them to be able to do this.

But sometimes it's just like this one little thing. And so I'm kind of like constantly surprised and mostly it's around when we release new models. Like it turns out there's some set of things that the model can do now. And you see new agents crop up that we're doing things we couldn't do before.

And so there's like nice little micro surprises that come out. I don't spend a lot of time thinking about accounting normally in a day job, but then you run into an accounting startup that can suddenly do something new and interesting. That's pretty cool. Do you think that data annotation is going to be a big part of that?

Do these models get tripped up because there's some data that they don't have enough of, especially with a niche industry? Is that kind of where you see some of the obstacles come into play? I think we need to learn from specialists. We need great ways to learn from specialists.

And I don't know if it's necessarily data in the classical sense that there's some pile of data that we're going to learn from that makes it better or there's better ways we need to incorporate the intelligence of accountants and lawyers and other people into our models. I think that can come from data. Some of it certainly will. I think that can come from talking to and interfacing and working with people in those domains.

I think a lot about learning directly from our customers. I think there's a future where more companies can contribute more directly to making the models do the stuff that they care about. It would be really nice if someone saw this big important thing that they'd like to achieve instead of having to sit around and wait for a lab to hopefully build a model that helps them do it. They can directly work with a lab and I think that's probably somewhere that we'll have where we can have more direct mechanisms of working with expert experts.

And yeah, I don't think it's surprising that one of the things that models are great at today is software engineering when the building that I'm in in private headquarters is filled with software engineers who know how to make models great at software engineering because they write software all day, you know? Yeah, exactly. It's deeply unsurprising but that's true and I think as we grow up as a company that's only a few years old and has spent most of our time hiring software engineers and figure out how to consult and work with more people in more diverse places and doing more diverse jobs, we'll get better at building models that are great at all the other things that people wish us were great at. We need to take a quick break.

We'll be right back. Support for the show comes from Odoo. Running a business is hard enough so why make it harder with a dozen different apps that don't talk to each other? Introducing Odoo.

It's the only business software you'll ever need. It's an all-in-one fully integrated platform that makes your work easier. CRM, accounting, inventory, e-commerce and more. And the best part?

Odoo replaces multiple expensive platforms for a fraction of the cost. That's why over thousands of businesses have made the switch. So why not you? Try Odoo for free at Odoo.com.

That's O-D-O-O dot com. Support for the show comes from Odoo. Running a business is hard enough so why make it harder with a dozen different apps that don't talk to each other? Introducing Odoo.

It's the only business software you'll ever need. It's an all-in-one fully integrated platform that makes your work easier. CRM, accounting, inventory, e-commerce and more. And the best part?

Odoo replaces multiple expensive platforms for a fraction of the cost. That's why over thousands of businesses have made the switch. So why not you? Try Odoo for free at Odoo.com.

That's O-D-O-O dot com. This week on Network and Shell, I'm joined by Tank Sinatra, the meme king, with over 15 million followers across Tank's good news, influencers in the wild, and his personal account. Tank is breaking down what the meme economy really is, how much a single sponsored post pays, why major brands are throwing serious money at jokes, and how meme culture, think preparation age, starter packs, and a perfectly timed screenshot is actually reshaping how we think about money and value. We're back with Anthropics' David Hershey discussing the landscape for AI agents.

Before the break, you heard David explaining some of the trends he's seeing, both from his work internally in Anthropics testing new models and from clients who are now using this tech in their industries. But now I want to ask David about Anthropics' big announcement this week, CloudSonic 4.5, and why is being built as such a step forward for AI agents? Just released CloudSonic 4.5, and to me, that was a big deal. I'm really in the weep on this stuff, so I was really into all the specs.

But for the average person, you know, that's just a word and a number with a decimal and another number. So why is this a big deal and how does it differ from your latest models before that? You know, what's the meaningful difference and the meaningful step change here? Yeah, I'm very excited about 9.4.5 too, and I'm also very cognizant that when my mom texts me asking about it, it just is a model with a decimal and a number after it's like Dr.

2. So there are a lot of things that I think are exciting. One thing that I want to say up front before I get into a lot of details that have jumped off the page to me about the model is it's sometimes hard to predict what the big impact is going to be on everyone. These models get generally smarter.

They get generally more capable. And not to keep harping on the same point, but sometimes we just don't know the blind spots that they had until we get over them. And so we released a model last year, Sonic 3.5, and suddenly all of these five coding startups happened. I don't think we actually knew that that was true.

We didn't know there was going to be like the moment we released that model. I don't think we could have predicted that so many companies would crop up and be able to help people start writing code with agents in this amazing way that we've seen happen. So part of what is exciting about Sonic 4.5 is the unknown to me. I'm really confident this model is the smartest model we ever created.

I've seen and I'll get into some of the things I've seen it do that I've never seen another model do before. And that's really cool. And part of the concrete stuff we'll talk about, I'll tell you, I'm excited to talk about some of the software engineering stuff I've seen it do. But some of the stuff that we really don't know until our customers and people we work with try to build cool new things that they couldn't make work before and they suddenly make it work before.

And so when I had to guess how my mom might see Sonic 4.5 and it might make a difference to her is that there's some product that she couldn't use before or didn't exist before in the way that perplexities happened in the past for consumers or cursor research now for developers. That's what happened that's going to impact our life because this model is capable of anything that we didn't know before. I have a hard time producing it. It's sort of the fun part and the hard part about being some doors for customers now, it's really hard to predict this stuff the day we launch a model, but it tends to happen, and that's cool.

Have you seen any trends start, even though it's been like one day, anyone that's starting to use it, you know, in a new way, one of your clients, or even in beta testing? Too early for our customers, I think, to see anything brand new, it typically is like, it's a very fast field, but I have like, I give it a month rule to find out what the new companies are going to be. My testing, I have certainly seen some new stuff, and with the testing some of the team has done, that is really exciting. One of my favorites to talk about is my team has been working on seeing how much of a software engineering task a model can take on at a time.

And so I think we're all aware that models can be used to write code. That's sort of obvious to a lot of people in this industry, and probably people in this podcast at this point, but it's often in pair with a human really directly, like going back and forth using quad code or an IDE, like write one thing at a time, check and review it. And the longer you let a model go trying to implement something being complex, the more likely it is to do all the stuff you didn't want to happen. And so we were really curious, like what is the most you can stretch that?

Like if I give quad a really huge task, like what can I accomplish? And one thing we've seen with this model is in a way that is sort of like really not true of models I've tested and seen in the past. There doesn't seem to, if you give like a sufficiently good overview of what you want to accomplish, I haven't really seen a ceiling on how much you can keep like consistently working on making something better. And so my favorite actual example of this, we released a video of this yesterday, but I'm going to talk about this until people get bored with me, is we asked quad to recreate quad.ai, like our consumer chat application from scratch.

And Cy4.5 just like worked overnight. We woke up and it just did it. Like this beautiful clone of quad.ai that works incredibly well. And my favorite moment from it, there's a feature that people like of ours called Artifacts where when you ask quad to make a document or make a web page, it will make it and render it next to the chat so you can like play with an app that you built live or whatever it is.

And I was saying to the person on my team who was working on this demo, like we got this thing, it'd be really like cool if quad could like build Artifacts itself. That would be amazing. It's like a complex feature. It's pretty hard to figure out.

It's like, we should try and see if that happens. And then he messaged me like two hours later and he didn't do anything. He didn't intervene. And he's like, hey, quad just built Artifacts on its own.

And it's like currently testing it live. It's just like trying it out itself. And it's just like, instead of this giant team of people that works so hard on this thing, the model is just like really capable of biting off really meaty, complicated tasks. Like this is something that would take me months to do if I did not have quad.

And overnight we sort of looked at it and watched it happen. And that progress of just like going from a point where it was neat that a model could write a snippet of code from a year ago to like, oh, it can do like a big chunk of the stuff of the complex developer work that I need to do. It's just like, it blows my mind. It was like a pretty big, wow.

So that took about 12 hours, you said? That specific one was like a 12 hour thing. We've seen like up to 30 hours of continuous dev work. Okay, this is what I was going to ask you about because something that unexpectedly went viral from my own article about Sonic 4.5 was the 30 hour bit.

The fact that Sonic 4.5 could code autonomously for up to 30 hours with no interruption. So I heard one engineer in Anthropic used it to code a chat app that the company compared when they talked to me to Slack or Teams. Obviously it was only 11,000 lines of code, so much smaller than Slack or Teams. And it seemed like just an example project, but people online are really excited about that detail and calling for Anthropic to release it.

They want to know more about it. So can you give me any details about that on your team that was testing it and how impressive was it really? Or was it just kind of rudimentary? Give us the details.

Yeah, yeah, yeah. This was the thing that also, I don't know, all of my excitement and reason I'm here is because I've been working on this exact thing. So it was someone on my team. His name is Justin.

Shout out to Justin. You should give praise to him. He's amazing. Recently joined the team on the side note.

Nice. This was born out of this thing of, a lot of people have built demos or proofs of concepts and there's this sort of vibe coding trope that you can, yeah, you can build it, you can quickly mock something up, but can really build a real application. Justin really wanted to test that. And so he was experimenting using our quad agents SDK, which is just sort of like a more programmatic version of quad code to some extent.

He was testing, like, can I give a full stack of like a complex thing, like an app similar to Slack to the model and just like watch it build. And we did some tinkering experimentation around to get it right. But yeah, you asked like, what's impressive? It's impressive.

It has like DMs and threads and channels and it's slick search functionality and you can upload images and GIFs and render them and like multi-user authentication. And we didn't ask it to, but implemented a whole bunch of like AI users for testing. So if you like log in, you can send a message and there's like Alice, the PM is in there that you can send a message to who will respond to back stuff about PM work. It is, it is remarkable.

It is like, it's not by any means Slack, but like if you didn't spend a lot of time thinking about it, you'd like look at it and think that was a pretty reasonable productivity app that you would use to chat with your coworkers. Wow, what did you guys name it? I don't think we have a name. I need to ask Justin to give it a name.

Actually, no, I like, he has a work chat, very boring. Work chat. Clearly not the product folks in the org. Thanks though, I love that.

Yeah, someone who worked at Slack, I think messaged me and said, you know, our code base is a thousand times that. And I was like, yeah, it's just an example, but I mean, it happened in 30 hours, so it's a big deal. I was also going to ask you what surprised you the most during your own testing of it, of Sonic 4.5. I'm just going to continue double clicking on this thing.

The way that it built the app was surprising and really interesting. I'm not surprised that the model's going to be better at building complex apps. The thing that was really interesting for the way it does this is there's a tendency to like bite off little pieces that it can handle really well and do that continuously. So a lot of times like models in the past would get really eager and ambitious.

They would say like, I'm going to build this whole thing and I have like these grand ambitions and it would kind of just like meander everywhere trying to do this miraculous piece of work. And the thing that was really cool about Sonic 4.5 is just like pragmatic kind of like it. It's like, okay, right now I'm going to test like does damage upload work and then it's going to do that. It's going to spend a little while doing that.

It just like bites off one little chunk at a time and that feels a lot more like when I want to co-worker work or collaborate work. Like if I ask you like, hey, I need you to go build workshop. I don't want you to go off on this like crazy escapade trying to make everything magical. I want you to just like bite off a piece of time, show it to me, like commit to get it, like that kind of thing.

And it's just a little bit more natural and collaborating with it feels more natural. And funny enough, I think this is unrelated, but we've been chatting in our internal company Slack with Claude a lot lately and just like chatting with it and it's been really natural. Like it feels just a little bit more like working with a co-worker does. In terms of the tone?

Yeah, like the tone, how it responds, like how it acts in Slack, how it participates in the conversation, what it tries to do. It's a little less like over the top and eager. That has jumped out a little bit, which is surprising. That's not something that I normally expect, but it's also been funny.

I guess it's just funny. Like it cracks better jokes. It's a little bit more witty, that kind of thing. Do you think it also dialed back like that?

I think it just gets in the way of doing good work and it's kind of big focus as far as like, I think nobody likes that and I also think it's bad for them and it's just bad. So I do think we have made meaningful progress and this model seems to be a little bit more willing to push back it and that's part of that thing, like being natural co-workers is someone who actually can tell you when you're wrong. When it comes to rebuilding Claude.ai, that's pretty big. Did that worry you at all?

Because, you know, it's kind of doing, like you said, you know, organic work that you worked on for months. Does that worry about job replacement for engineers, anything like that? What were your thoughts? We have a great team and I'm really not worried about putting my friend in two ways.

I'm not currently worried. Right now, Claude is a collaborator. Like, it works really well with me. It accelerates me.

I think it makes our whole team better and faster in writing software. I, in general, like, this is not a now thing. To be really honest, like, watching Claude go for 30 hours, like, it does create a little like, oh my God, like, it's like a pretty different thing. It is a meaningful step change and I think it does, like, this technology, Anthropics is founded on the principle of this technology so it'd be hugely impactful in the world and part of that is it's going to change how we do jobs and something like doing a whole week of work for me, like, that's just like me, it's going to change the industry of software engineering and so, yeah, there's like a little bit of and I would be like, it'd be oofy for me to say it's not any amount of like this is, like, how are we going to work next?

Like, how do I incorporate this? What is it? My net net, though, is like, I just think there's a ton of room to make us better and make better software for users to make the world a better place for this technology and I'm really confident that still, but there's like a smidgey of, like, we're going to have to figure out some new ways brought to you by Nespresso. Hear that?

That's your next obsession. Every coffee, a new world. Every sip, a new taste. This is a new Nespresso.

One touch, endless possibilities. Ice, flavored, long, short because some days call for the espresso kick and sometimes a smooth silky latte just wins. It's exceptional but effortless, like actually effortless. Simply press, brew, and explore.

Nespresso, what else? Keep exploring at Nespresso.com. We're back with Anthropics Applied AI lead, David Hershey. Before the break, we were talking about the capabilities of Claude Sonic 4.5 and whether he sees a future where this technology might even automate parts of his job.

But now, I want to ask David where he sees the model falling short and what might be next for AI agents. Well, what are the primary limitations to Sonic 4.5 that you wish you guys could have offered with this release that you couldn't? And what was something that you tried to make it do during testing that it couldn't do? Basically, yeah, the features or the context window or anything else that you wish you could offer that you couldn't and also what was dumb about it in testing.

I have a pet thing that I always test on which is I really like to make Claude Play games. I'm accidentally famous for creating Claude Play's Pokemon in the past. Oh, that was you. I forgot about that.

Yes, that is my project. And I have also tried a lot of other games. It's just one of those examples that exist in the world of it can do PhD-level math and I can't and just for it to not really understand that it can't walk straight through a building hurts my brain and models are kind of weird and funny that way. So that one was the one that I found out the most is I really wanted Claude to be this great chess player that's so logical and interesting and then it doesn't know where the pieces are on the board and where other pieces are and it's like, ah, you're almost there, Claude, one day.

And features I think are still out there. I have been so in on this model that I have to think about the next level. Honestly, the real answer I don't know. I can't think of some feature that I wish to say.

There's a lot that I want Claude to get better at. There's so much that I want Claude to get better at. I want it to be able to beat Pokemon one day on its own and I want to see how it does it. I want to, again, there's all these fields where I know that Claude still needs to get better.

I always talk about the goal that Claude's becoming a better lawyer but I don't think it's as good as a lawyer as a software engineer yet and I still wish we worked on that. And we're making progress. I know the teams that are working on all these little things and focusing and thinking about all these little things and there's always so many hills in my mind that we have to climb still. So I could probably, people happy here.

It's a fun experiment. I don't think it has crested the peak of what we need to train a lot to be good at yet unfortunately one day. Well, let's talk a little bit about what's next. So talk to me about why you think Anthropic is pursuing such gains on the AI coding front and how it stacks up to competition.

Obviously in testing, you had to take into account what the market looks like right now and what other models can do. How did you think Sonic 4.5 stacked up during testing and what stuck out to you there? Coding market is really important for a lot of reasons. It's a really well positioned for people to build with our models.

It's a place where we can have huge impact. People have figured out great ways to integrate models into how they do a job writing code and I think more so than essentially in other industry and so there's like a right now where you can make a huge difference by continuing to make models better. I think coding is probably like the best use case and so it's a huge focus for us. We want to keep helping people who are relying on our models write code be able to get more out of them and our belief we've talked to a lot of customers with a lot of testing.

I'm pretty confident that this is the best coding model in the world. We have a lot of benchmarks and other things to prove that out and it certainly has for all of the people here in our early testing just a huge step function improvement in what it feels like within quad code or other service areas to develop and write code and from my perspective this is like a really big challenge and personally like this is just talking for myself this is the most noticeable change and improvement I've seen since we released Sonnet 3.5 last year which I think is a funny coincidence I don't think we necessarily knew that these .5 Sonnets were going to be such special models for us but just mechanically like this is something that we're interfacing of vibes and numbers and things I've seen this feels like a really really meaningful step change improvement one of my favorites just to maybe call it out I love the team at Cognition I spent some time working with them and they put out a post yesterday about how much their product have improved with the model and some of the work they did to make that product better that I left this cool one and that like really I think that just that validation from the customer like it's a huge jump in a benchmark that I really haven't seen have a benchmark that jumped for them in a while I think that's really cool like for the right product and the right way around this model I have a feeling this is a huge step function change and ability to write coding and I'm quite confident it's going to be the best model in the world for people like that Yeah that's really interesting because I usually don't mention benchmarks much in my own reporting because sometimes they can be subjective and sometimes they're you know created by the companies that are testing their own models in very specific areas with very specific sets of questions but I think usually what I hear from engineers is that they go based on the feeling and based on what things it can do that it couldn't do before in their own anecdotal testing so that's why it's interesting to see you know you have your own pet products that you like to test on and you've seen changes in that regard instead of just in the benchmark specifically I also wanted to ask you about Anthropics chasing consumers versus enterprise versus governments so you know all AI companies right now it seems like are kind of like working with those three tiers and I think it's probably because you know those are two of them maybe are more concrete areas for potential profit so do you think with Sonic 4.5 and just all the stuff you're working on in general I know you work mostly with the enterprise side of things but is there a specific slice that Anthropics is chasing more right now and why I think they're basically all really important and I think one of the things that we have luckily seen and I think it's true for all of the labs is that when we make our models generally smarter they service all of those segments they service enterprises who are building with us they service consumers who want to use our chat app or write code or get into that and they're useful for the public sector too just to tie a little bit of a link here I work with customers and startups that have become big consumer hits like Lovable is a great example where you know it's like that's an enterprise customer from my perspective that I spend much time working with and hoping to be successful and our team does but then they go turn around this thing that everybody can use to build just-in-time apps to serve every part of your life and so I think this kind of just depends back on itself where in reality our focus is building really great models that are safe and we have seen time and again that that results in sort of like success in all these places and there's like a lot of different ways from our perspective and again this is talking a little bit from my personal perspective instead of just Anthropics but there's a lot of ways that you can make progress with great models and whether it's helping enterprises or helping consumers or helping the government like these all have a way of working back in on themselves where the thing that we really do makes a difference is make great models and there are a lot of ways you can impact different segments with that and when it comes to consumer use cases open ai seems to be pushing pretty hard into that right now at least this week they launched polls recently last week and then yesterday they debuted their instant buy button you know i wanted to ask where you think anthropic is planning to meet consumers you think it's more likely you'll be reaching customers with cloud directly in the future or is it more likely customers will use cloud through something like cursor you know and that can also apply to your startup clients too right where are you seeing people really find cloud i definitely would think we are growing just like the presence of the applications we build to interface with consumers yeah and quad code is a great example of this i know it's not like the traditional consumer but there's like a very prosumery thing that we've captured with cloud code and i think we've demonstrated that we do have some of the muscle to capture like for some sets of clients a thing that they love that sparks the excitement of a consumer product that people love using and i guess like obviously we would love to do that 50 times over and we would love to be able to invent a bajillion beautiful uses of quad that interact where that that consumers love and love to build and we're investing and trying to find and build great experiences that help people do important things with quad we have this constant increment we have this imagine demo that came out just a limited time thing i don't think that's the product we're going to launch that is consumer facing but like we just like it's part of a portfolio i guess of like how we think about we need to keep building and trying and innovating and inventing products seeing if there's something there and we have a good opinion of what models are capable of it helps us have an interesting mindset we build so it's a big focus like we need to get there like it's important for us to build direct relationships and help people have awesome business in quad that said like i don't know i'm biased as a customer guy a little bit but i also just think like i wouldn't ever and i don't think and i don't think we should and i don't think the topic is banning against the ecosystem of people who are trying to build amazing products and i think it'd be really silly to claim that that's all our market to grab i just don't think it is like if we build great models then there's an incredible ecosystem here in Silicon Valley and abroad building these products that i think is like probably bigger upside than our first party applications will ever be so i'm like i think that will always be a huge focus of course awesome and then last question for you back to the ai coding market and how important it is and how much most ai labs are chasing that right now we talked a little bit about vibe coding earlier and here at the bridge a lot of us have tried it and to no avail we had some success with really simple stuff but not a lot of success with building large shell applications a couple of us did but not as much as we actually expected you know did you see a big change in terms of testing out like vibe coding with sonic 4.5 or was it the same as before i have noticed like my own personal coding like every model we release i am trained as software engineer so it's like cheating i'm like not the best getting paid for vibe coding but i still do it on the side sometimes and you notice meaningful improvements on like how much it can turn out before it goes off the rails and how you can trust it i actually i think this is actually an interface problem and i think there's something funny about ai which is that it has a tendency to outgrow interfaces really fast so if you look at the history of coding with ai there was probably like copilot with ghostx so copilot would automatically complete your code for a while there was you would go to us or chat cpt or wherever it was and you ask it to write some code for you in a browser window if you were developer and then you copy and paste that into your editor cursor figured out how to sort of like bridge those two things where it could be side by side a lot of people started building this sort of agent that was aside alongside your idea to let you build things i don't think any of that is quite the thing that we need for everybody to build production applications though like there's some interface where i actually do think sonic 4.5 is the first model that could be that thing where anybody could build a sort of production-ready application i've seen enough evidence of one to its own devices how it can build complex applications i've seen it be able to deploy a complex application to aws like fully autonomously and do like a security audit on it those are both really incredible things that like make me think this is the model that can sort of like cross that chasm and get to the point where like anybody can make something production ready i have a feeling that we need like one more interface that isn't quad code and isn't cursor like the next step past that i think needs to happen to get like it's more obvious to everybody instead of having to like try to figure out if you're on the right path by coding awesome well thanks so much i'm so glad we got to talk and i appreciate you uh coming on with so last minute i really appreciate it yeah it was fun i appreciate it of course it's really nice to be on i'd like to thank david for taking the time to speak with me and thank you for tuning in i hope you enjoyed this episode if you'd like to let us know what you thought about this show or what else you'd like us to cover drop us a line you can email us at decoder at the verge we really do read every email or hit me up directly on x blue sky or threads i'm adding field on all platforms decoder also has a tiktok and an instagram and now also a youtube channel check those out at decoder pod they're blast if you like decoder please share it with your friends and subscribe wherever you get your podcasts decoder is a production of the verge and is part of the vox media podcast network our producers are kate cox and nick satt our editor is ursa wright the decoder music is by breakmaster cylinder see you next time

Share this episode

Similar Episodes

I'm ok

Mar 26, 2026 ·1m

REMIX: Why we over-shop and compulsively acquire, and how to stop, with Dr Jan Eppingstall

Jan 9, 2026 ·61m

REMIX: OCD and hoarding disorder with Jenna Overbaugh

Jan 2, 2026 ·47m

REMIX: Therapy and hoarding disorder - what are the options? With Dr Jan Eppingstall

Dec 26, 2025 ·78m

REMIX: ADHD and hoarding disorder with Professor Sharon Morein

Dec 21, 2025 ·46m

#207 13 actionable pieces of mental health advice from six former podcast guests

Dec 12, 2025 ·53m

Similar Podcasts

MG Show MG Show The MG Show, hosted by Jeffrey Pedersen and Shannon Townsend, is a leading alternative media platform dedicated to uncovering the truth behind today’s most pressing political issues. Launched in 2019, the show has grown exponentially, offering unfiltered insights, comprehensive research, and real-time analysis. With a commitment to independent journalism and factual integrity, the MG Show empowers its audience with knowledge and encourages active participation in the political discourse. French Your Way Jessica: Native French teacher founder of French Your Way Boost your French listening skills and test your comprehension with this one of a kind series of podcasts. Get the chance to listen to a real conversation between native speakers talking at normal speed AND customise your learning experience through carefully designed sets of questions (2 levels of difficulty) available for download at www.frenchvoicespodcast.com. All interviews also come with the transcript. French teacher Jessica interviews native speakers of French from around the world who share a bit of their life and passion. Where else would you meet in one same place a French yoga teacher based in Melbourne, a soap manufacturer from Provence, or a couple cycling around the world? That Hoarder: Overcome Compulsive Hoarding That Hoarder Hoarding disorder is stigmatised and people who hoard feel vast amounts of shame. This podcast began life as an audio diary, an anonymous outlet for somebody with this weird condition. That Hoarder speaks about her experiences living with compulsive hoarding, she interviews therapists, academics, researchers, children of hoarders, professional organisers and influencers, and she shares insight and tips for others with the problem. Listened to by people who hoard as well as those who love them and those who work with them, Overcome Compulsive Hoarding with That Hoarder aims to shatter the stigma, share the truth and speak openly and honestly to improve lives. The Small Business Startup School – Business Notes | Financial Literacy | Retail Psychology – For Professionals & Entrepreneurs The Small Business Startup School Inc. Starting or buying a small business? While personal circumstances may vary, business patterns remain timeless. On The Small Business Startup School, we explore strategies, insights, and practical solutions to help entrepreneurs confidently navigate their journey.Hosted by Ola Williams—a retail entrepreneur, fintech founder, and financial coach with over two decades of experience—this podcast marries financial awareness and retail psychology with optimism to deliver actionable takeaways.Join us to learn, grow, and connect as we uncover the keys to business success.Let’s continue to learn together and be encouraged to keep on connecting!

Frequently Asked Questions

How long is this episode of Decoder with Nilay Patel?

This episode is 46 minutes long.

When was this Decoder with Nilay Patel episode published?

This episode was published on October 2, 2025.

What is this episode about?

This is Hayden Field, senior AI reporter at The Verge and your Thursday episode guest host. Today, I’m talking with David Hershey, who leads the applied AI team at Anthropic. I wanted to have David on because earlier this week, Anthropic released a...

Can I download this Decoder with Nilay Patel episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.