Full-stack approach for effective AI agents

What this episode covers

There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface. In this episode, Josh, Imbue’s CTO, tell us more about their approach and some of what they have learned along the way.Sponsors:Neo4j – Is your code getting dragged down by JOINs and long query times? The problem might be your database…Try simplifying the complex with graphs. Stop asking relational databases to do more than they were made for. Graphs work well for use cases with lots of data connections like supply chain, fraud detection, real-time analytics, and genAI. With Neo4j, you can code in your favorite programming language and against any driver. Plus, it’s easy to integrate into your tech stack. Fly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs. Featuring:Josh Albrecht – LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:CARBS (Imbue’s cost-aware hyperparameter optimizer)Imbue paper on the stepwise nature of self-supervised learningA paper on initialization/feature learning co-authored by Jamie Simon, a member of Imbue’s technical teamImbueUpcoming Events: Register for upcoming webinars here!

of MATCHES

TRANSCRIPT · AUTO-GENERATED

Welcome to Practical AI. If you work in artificial intelligence, aspire to, or are curious, how AI-related tech is changing the world, this is the show for you. Thank you to our partners at fly.io, the home of changelog.com. Fly Transforms containers in the microvims that run on their hardware in 30 plus regions on six continents so you can launch your app near your users.

Learn more at fly.io. Welcome to another episode of Practical AI. This is Daniel Weitnack. I am the CEO and founder at Prediction Guard.

I'm joined as always by my co-host Chris Benson, who is a principal AI research engineer at Lockheed Martin. How are you doing, Chris? I'm doing very well today, Daniel. I'm hoping that we can imbue today's show with a sense of wonder and exploration.

Yes. Thankfully, we have an agent on the show with us that's going to be very helpful in that today we have Josh Albrecht, who is CTO and co-founder at imbue. Welcome, Josh. Thanks.

It's great to be here. Yeah. Well, we sort of in a not very funny way, T-W-A, a couple of things to talk about there as related to agents. Give us a little bit of background you talk with imbue about the dream of personal computing, the dream of agents doing work for us in the real world, your approach to that.

We'll dig into a lot of those things. Could you give us just a little bit of background in terms of how you as founders of imbue came to these problems around agents and accomplishing more kind of complete or complicated tasks with agents? Yeah. I mean, AI is definitely something that I've always been interested in and excited by.

I remember a long time ago, my friend read some book in middle school, I think, like maybe Ray Kurzweil, the singularity of the year. He's like, oh, wow, there's AI. Well, so exciting. And did all that come true.

I don't know necessarily, but it seemed like an interesting thing. And I was always interested in thinking and logic and AI and neuroscience. And when I went to school, I was originally going to do cognitive neuroscience, but the professor was a little bit too boring. So I did AI research instead.

And so ever since then, I kind of, you know, I thought much of papers and things, but they felt like I wasn't going to have a big impact on the world. So I went off to do startups. But all the time, I was always looking back and looking and saying like, oh, it's like now the time to get back into like more fundamental AI research stuff work yet. And eventually it came up quite well.

So yeah, stuff is working. What I've always wanted to do with AI systems is like make better tools for us. Like there's so much work that we have to do in the real world that is just not that fun, not that interesting and not really moving things forward. And so all my time at startups and the things that I've been working on, they've all been very practical, very applied versions of machine learning.

And so I've always wanted to, you know, we are an AI research company, but it's not AI research for AI research, it's AI research to actually make tools that are useful. And so what we're doing in a few is we're trying to make tools, you know, I leave them to start for ourselves, like can we make robust coding agents for ourselves that can really help accelerate us and help kind of take over some of the boring tasks that we don't necessarily want to do. And that's what sort of gets into agents. So agents are AI systems that are acting under behalf tools like, you know, chat bots etc.

Are really cool. It's great to be able to answer questions, great to be able to generate texts. But if I have to copy and paste that text every time over into some other thing and like do all the work myself, it can only save you so much time. It's like a better version of Google at the end of the day or a better version of like Surgeon General or something like that for a book.

And so I think the real promise of AI is in systems that actually take actions, but in order to get that to work, we still have a lot of work to do on the capability side. Like when you're talking about taking actions in the real world, there's a lot more risks, a lot more kind of downsides that come from that. And you need to be careful about like, you know, you don't want to empty the user's bank account. Like that's going to be a really bad product experience.

Right. So how do you make systems that the user can actually trust systems that are robust, systems that you can know are actually correct and that flag for you? Like, Hey, I'm not really sure about this. So this is kind of what we always talk about coding and reasoning is we're talking about the ability to kind of like understand the outputs that are actually being created and understand is this correct?

Is this actually going to be useful for people and really like thinking it through more like a person instead of just, Hey, here's a generation. Good luck. So that's kind of how we got to agents is like, we want to make practical systems, we care about making these systems actually robust and useful for people. And what a lot of our research is focused around.

When it comes to agents and sort of where we're at with them now, so we're recording this in May of 2024 for those that are listening back, how would you kind of categorize in your mind? Because you know, you can download the blank chain, you can like create what is a agent, you know, maybe for this purpose or that purpose that searches the web or does this thing or that thing. And there's certainly even in my own experience, a lot of fun to be had in that for sure, but there's a lot of challenges in making this sort of, at least in the enterprise setting, making this a reality for solving problems, much less than sort of my, those random times in my personal life where I need to do things. So how do you categorize like, as of now the state that we're now of course, everything's changing, the sort of main sets of challenges that where people are hitting blockers when they're trying to create these agents?

Yeah, that's something that we actually played around with a lot last year. We interviewed a whole bunch of founders of different agent companies, both, you know, like on our podcast and our Thursday nights and AI events and also just in person kind of off the record. And friends and friends, people starting companies really trying to understand like, what are the problems that people are running into when they're trying to make agents? And the thing that we kept coming back to is, you know, there are all these tools like Linkchain and all these other bits of infrastructure out there, always testing things like scorecard AI or all these different like libraries.

But the problem that people really had was like, what you really want as a software developer is, but does it actually work? Like, does it actually answer the question correctly? And can I get these things to do what I want as a product designer or as an engineer without having to specify all of the like details myself, like that sort of the promise of AI? And right now, they're really great for getting like a first pass version of the system working for us.

Like, oh, cool. Like you ask it a thing and like 60, 70% of the time, it's right. That's great. That's so amazing.

But 60, 70, 80% isn't really enough for like deploying this and going from that 80 to 95 to 99 to 99.99, like that's actually a lot of work. And so people may also have techniques, you know, or rag or four kind of other types of ways of conditioning. The answers to kind of make them better and better. But the things that work today are kind of the more constrained versions where you're sort of, you know, you're asking like a very simple question or you're in a very narrow domain.

And so the programmers, the product designers can like make sure that like everything works out within these rails. Once you are in the more like general assistant category, it's like a lot harder. I think we've seen a lot less stuff be successful there. But I think in terms of categories and like in terms of kind of the problems that people are running into, I would say the main one I would summarize is robustness, like correctness, like can you actually get these things to be robust all the time?

I think that's what really distinguishes agents. Like if you think about agents in the real world, like a dog is an agent, I am an agent, a robot's an agent. Like a dog is actually extremely good at not dying for a really long time, right? It's not that 90% of the time when it walks across the road, it doesn't get hit by a car, like 100% almost 100%.

Most of the time, you know, it's like pretty safe. Like it's usually like as agents we're being very, very conservative, very cautious. So we take correct actions and there's a lot of like heuristics and intelligence that goes into being conservative, being risky, but it's like being able to take a long chain of actions without going wrong without something else, or what going wrong. And our agents like don't have that kind of common sense in that kind of reasoning right now.

I think they did make it a lot easier for people that are building agents. As we were kind of going through the last couple of questions, talking about kind of the problems that people are running to when they're trying to make agents work and you know, what can they do to ensure that it has a good outcome? I also run into people all the time who I think really struggle to understand, you know, within the context of this, you know, all the hype and the boom of generative AI, what can you use an agent for productively in enterprises in 2024? You know, they're used to going to these web interfaces that are becoming ubiquitous for us all, but the notion of saying, okay, I'm going back to what you said earlier on kind of getting it out of that web interface.

Can you kind of paint a picture about how people out there who are trying to bring this productively into their organization as an agent versus a web interface, how they might even conceive of what, of how to approach problems that they might want to solve with the technology? There's a lot more work to be done today to make agents work for a system. I think if you approach it as a more holistic system, then it's more likely to work. So if you think like, okay, where are the places that could go wrong?

What's the confidence that I'm getting back from the system? Can I flag that for human reviewers? Can I have like a bunch of different checks in place that are both like in domain, like for programming, like does it pass a linker or does it pass this like style guide or like does it at least type check, right? Or does this syntax direct?

Like there's a lot of checks that you can do kind of in domain, but you can have other different inter-industries as well. And then there's sort of, you can use the LLM to score this and ask like, is this particular thing wrong? Is that particular thing wrong with it? So as you start to build up more kind of like safeguards and guardrails around these, then you can start to get them to a level of robustness where like maybe for the easy cases, it's okay for your application for it to fail.

And you know where that failure rate is and you don't want to work to understand like how much can we tolerate? One of the things that we've done a lot internally is working on our own evaluations. This is a really critical thing for anyone who's like trying to build real systems. You have to get really into the weeds of what does it mean for the system to be right?

We've actually taken all of the open source NLP benchmarks and made our own internal versions of these systems to make sure that they're not contaminated by the training data and to make sure that the questions are actually correct. So one of the things that will have coming out in the not too distant future actually is I think hopefully being able to contribute back some of that evaluation work that we've done of like cleaning up these existing benchmarks. But we also have a bunch of our own internal ones as well. And I think it's kind of critical for anyone making the system.

So like make them yourself like by hand, at least like 100, look at them, is this the right answer? Okay, what did it get? Getting to a place where as humans you agree on this, you're getting a machine system and calibrate well to this, then you're checking like, okay, are the things that we're getting as inputs in production? Like are they from the same distribution?

Like does this test actually make sense for this? They're not drifting. Like if you have adversarial systems like fraud or something much, much more typical. If you have something where you're getting the same kind of query every time, then it can be possible to get something where you can trust it enough to say like, okay, cool, this is getting us 99% that's acceptable.

We have some guardrails here, we can check how well it's doing over time. We have people looking at these and auditing some of them. That's kind of the way to make this really useful as you have to be like really getting into the details of how do we evaluate this, what does success look like, etc. And for the use cases out there that like the most successful use cases that you've seen, I don't know if you have good examples of those either internally or externally, but when you think of those, I like what you're saying about digging into the details.

I'm wondering also how much sort of specific domain expertise is actually factoring into how you handle those details. So if you're building an agent to help people process data in a healthcare scenario or data in a financial services scenario or in a coding assistance scenario, there's kind of this view like if I just download Langchain, if I go and kind of have the zero shot approach, right, where this agent might be expected to do anything, my impression is that the most successful, agent type of workloads out there so far have been very much driven by people with high degrees of domain expertise in an area that are able to work through those details. Is that impression correct? Do you have any thoughts on that?

Yeah, that seems pretty much right. I think there's this promise of AI that like someday you'll be able to just ask it to do anything and the interface sort of, you know, affords that it looks like, oh, like there's this text box, I can just ask it to do whatever and it will give me back a response and why would even sound so confident. So correct. So I'm just going to do anything, maybe even succeeded at that case.

One example that I love from a little while ago was we were trying to see how well-existing LLMs were doing at detecting bugs and so we asked, like, the first thing that I looked at was like, okay, like, is there a bug on this line? I found a function that had a bug as it says, yes, there's a bug on this line. So, oh, wow, look, it's so good. It's like, wait a second.

How about this other line? It definitely does not have a bug. Oh, yeah, there's a bug on this line. This doesn't work.

It's like, wait, wait, wait, you're just always saying, yes, like, this is not quite right. So, yeah, it seems to like promise that but you have to really dig into the detail, use few shot examples and retrieval and all these other kinds of techniques to kind of, like, get into the weeds and the more domain expertise that you can bring to bear, the dramatically better I think outcome is going to be. So, Josh, I'm really intrigued by sort of the statement on what you put online in terms of abuse, thinking about building robust foundation for AI agents as being a full stack approach. I like that because it sort of reminds me, I don't know, Chris, if you remember quite a while ago when we were still talking about data science, I guess it's data science is still a thing, but you're talking about it a lot more years ago.

And there was this kind of, I think it came up a few times like this discussion about being a full stack data scientist and oftentimes those are the most productive where you have an understanding of how data pre-processing happens and building your application, how the model is embedded in software and deployed and all of this stuff. And so, I love that sort of thinking in that respect and I'm wondering from an abuse perspective how you think about taking a full stack approach when it comes to agents. Yeah, we take it, I think, to a slightly more extreme degree than most people in that we do everything from setting up our own hardware, building our own infrastructure, pre-training, fine-tuning, RL, evaluations, data generation, cleaning, UI, user experience, like the whole thing. And there's that at each one of these places, you can tweak some things to make the overall thing work better together.

So you can kind of change the training data that you've used in your system in order to make it more like the kind of thing that you actually need for your product. And then in RL, you can set objectives that are related to things your user actually cares about. And then on the UI, you can use the capabilities that you have to help highlight places where this particular system fails. So I think we're really interested in the full stack approach and the ability to tweak things at each one of these levels.

And for us, it comes from our history as a research company, one thing that we've always really focused on is being able to deeply understand the technologies that we're working with us. And then we're trying to do this, you know, pre-training, fine-tuning, doing RL. It's not just a flat box. We want to open these things up and understand what's actually happening inside of there.

We have a paper club. Every Friday, we're looking at the state of the art stuff that's coming out, reading through this and trying to understand what our neural networks are really learning. How is this language model actually? Where does it fail?

There are really interesting papers that show particular logic puzzles where this thing doesn't work. It's like, oh, okay, it's not really doing logic. It's not really doing addition. It's doing this other thing.

But if you tweak it in this way, like, oh, now you can get it to learn a simple form of addition that is more general. Okay, that's really interesting. What is a transformer really good at learning? What things in the data actually matter?

And how do you evaluate these things as well as another thing that we also thought about a lot. One of the things that we set up that has been super useful is looking at the, not just the accuracy of our systems, but the perplexity on multiple choice-question-answering datasets, specifically not perplexity overall, but perplexity specifically for the multiple choice-question-answering things. This gives you a much more fine-grained understanding of like, is this actually being right or not? It gives you a really precise metric for this.

And this idea came from a paper, which was about, you know, I think something like our emergent properties of language models in a garage or something like that was the title of the paper. The point was like, you know, a year or two ago, people were like, oh look, these language models have these emergent behaviors. Like, they're suddenly learning to reason or whatever. It's like, oh, wow, they're like, something so smart.

But when you really dig into it, it turns out that if you look at their performance on a log scale, it's linear. So what was really happening is just our metric was not very good, right? We weren't really asking the right questions. We weren't deeply understanding what was happening.

It was just always in a log scale, just always getting better. And you just couldn't see it in the metric. And so for us, you know, this is a good example. Like, you want to deeply understand what's going on here.

We don't want to just treat these as magical entities, but rather they're just technologies. They're just really bags of features at the end of the day that we can use to do actual work in the real world. And so I think that's kind of our approach is to like take the full stack approach, understand everything from like, okay, how does the infinite network work? Like, how does that fit into our performance and optimizations?

Like, how does the data work? Like, how does that network work? Like, how are all these things adding up to give us, you know, some final errors and final user experience that's really good? You're kind of really fascinating me with that statement.

So many people are do take kind of that black box approach and they don't necessarily have that kind of research first orientation that you're describing as a company, as a business. How does that research orientation where you are rejecting the black box perspective and saying we're going to open it up, we're going to tinker, we're going to understand the specifics of how small changes, you know, affect that. How does that affect how you approach this compared to whoever you would perceive as your competition or something? What does it mean for you as a company to take that kind of research first approach?

Yeah, I think there are trade offs to it. One trade off is that, you know, it takes a little bit more time and effort to do this, to like really deeply understand things rather than just like hack it together and throw it out there. But I think the benefit is in the long term, like when we do really deeply understand these systems, it makes it a lot easier to make modifications and to make changes and to know how to improve things. And this is a very expensive to train, like there's a lot of effort that goes into this and it can be very expensive to like just try a whole bunch of things and like you don't really know what you're doing, it's easy to waste a lot of time.

And so I think for us, we would rather take a step back and say like, okay, what's actually going on here? Can we make robust systems? Can we make robust baselines? Can we get this working in a way that like we can trust our results that we can understand what's going on and build on top of those?

Another thing that we've built internally that was been really useful kind of along these lines is carbs, is cost aware, parato, region, Bayesian search or something like that. Basically, it's a hyper parameter tuner that is cost aware. So we can take any system that we have and say, hey, you have all these 10 or 20 different hyper parameters, you can build like, how do you get, you know, habits and that works, but how do I make it way better? We can take this, just throw it in there, come back the next day and it's tried hundreds of experiments at different scales.

So it tries at a really small scale and it's like, okay, for a really small scale, like this is the best way to do it. And as we get higher and higher and spend more time and resources and money on it, like this is kind of how these hyper parameters change, how things change, does we scale? And just understanding that like there are these scaling laws, there are scaling laws for different parameters, like how can you back those out and learn for any given architecture and given problem, having an automated system to do this allows us to kind of like quickly develop this and just some time to make the system, right? But it like really pays off to have that kind of deep understanding of the systems that we're working with.

So I think for us, it's kind of like taking a long term view, I think in the long term, it's much better to actually understand what's going on. And it does take a little bit of upfront, you know, work. That's why, you know, we don't necessarily have a product we're working on it. I think we'll get there and I have confidence that we'll get something really cool, but it does take a little bit longer.

And that's, okay, I think we'll end up with something much more as a result. As someone who's working both on the, like all the way up the stack, even up to interfaces and all of that, but you're also training these foundation models, certainly the sort of both the market and the technology and the options around foundation models have just sort of blossomed and these have proliferated over the past year, especially. What's it been like internally, you know, I, we've had a couple of people on the show and I find this interesting, like from the perspective of someone inside a company that is training their own foundation models, how do you go about maintaining focus within the sort of environment where eventually you're going to have to spend a significant amount of time, you know, investing in a model, our specific model architecture, specific data sets, that sort of thing, but you know, things are shifting all the time. You mentioned reading, you know, papers and trying to keep up, but yeah, how do you maintain that focus and what, what's sort of life like in the midst of being a foundation model builder in May of 2024?

Yeah, I would not necessarily characterize this as a foundation model builder and that part of what we do is train models, but that's not the only thing that we do. Yeah. And the reason that we do it is not necessarily to make, you know, the biggest, bestest foundation model ever. Like I think there's, you know, a lot of money going into other companies and spending huge amounts on these and general purpose, on general purpose versions of these systems.

And I think for us, the more interesting thing is can we make them more specialized ones? Can we take these? Can we adapt them? Can we make them more specialized?

Can we find ways to have them work together to pull different things together and make a model that's kind of better at doing that, that sort of synthesis and kind of like pulling these things together and better at the particular tasks that we care about? We've seen really good results. We'll have some blog posts in the next few weeks about this, but I think we've seen some really good results on much, much smaller models. And so, you know, I think if you look at like Deepseak Coder, for example, I think that model is still significantly outperforms LAMA, a model with the same size and even of much larger models.

And so, I think that's really trained on a lot of code, and so to generate code is something it's very familiar with, as opposed to being a pretty small part of its distribution. I think, again, this comes back to the fundamental understanding part. Because we know these are just bags of features, yes, having a bigger bag of features is definitely better, but then your different sign goes up as well. And if you want better bags of features, you need to give a good data.

The really important thing here is the quality of data that you're giving it, less so the absolutely massive size, I think, for practical uses. So, our focus is, can we make these really specialized and very useful for ourselves for our own purposes? I think we're pretty happy to see people out there competing, making better technologies, driving the cost of these things down, making huge context windows, giving them away for free in many cases. That's great.

We're happy to see more competition there because I think that part of the more interesting is how do we actually use these things at the end of the day, but altogether to be really useful. I love that you mentioned DeepSeek. That's a favorite of ours as well, at Prediction Guard and generating SQL to do data analysis and code and in our chat interface. Yeah, we love that.

And so, yeah, I totally agree. There's a lot that can be done with that sort of thinking. You also mentioned you're in your work kind of more, and I do want to get to kind of more the front end interface side. But before we get there, you mentioned kind of pursuing fundamental laws behind deep learning in order to, again, understand and create this foundation for the agents that you're building, what have been some of the things that you pursued in that area as kind of the theoretical underpinnings for this progression towards robust agents?

There's a bunch of things that are still in progress that I can't see too directly, but we're definitely interested in, say, how do you initialize things properly, like the UP work by Greg King, et cetera. We have one of our researchers to collaborate with up his and working on understanding exactly what the right way is to parameterize these language models in a theoretical sense, but for a practical reason. So if theoretically, like this is the right way to parameterize them, then the practical implication is you no longer need to tune the learning rate as you scale them up. This is super helpful because that's one of the key factors.

So to remove some of these hyper-prenders makes it much more efficient to kind of explore this space. So that's an very concrete, simple example of a place where a theoretical understanding can help you. Other places where this can help are not as easy to point at the exact theory and sort of more informed by that or more like physics. Physics didn't start with perfect theories of everything, sort of like did some experiments and had a more experimental understanding of the world before we had perfect theory about why everything worked.

I think we're at that phase with the she learning as well. So there's an interesting work by one of our researchers, Jamie Simon, who on kind of like what's actually happening in the fundamentals, like when we're learning things, like there's this notion from one of his papers about learnability, like a network of big size can only learn so many things and it's like very precise. Or we've had another paper about self-supervised learning where you can see like, oh, there's this sort of like stepwise nature. It learns like each piece of the thing.

So each of these little like theoretical things is telling you something about how they work. We don't have a full picture and the real ones are like quite complicated and a little bit more complicated than these smaller examples. But each piece is giving you like a sense for what's going on and allowing you to like operate in this space without having to like guess and check quite so much. It's not as much of a black box.

It's more like a machine where you don't know the exact internals, but you know, like don't make it too hot or it'll explode. Like don't make it too hot or it'll explode. Like don't make it too hot or it's not going to work. So you can see not just learning right, but other sorts of precursors earlier.

Like you can look at areas like norms or other quantities to understand like is this like, you know, getting too large is this growing large over time? Is this something that's actually too small? Like we want to, you know, we can actually upload it right later. Or do we need to apply more regularization of a particular type?

Like you can kind of get a sense for these things, even if we don't have kind of perfect laws yet. And then we can see how do these parameters change with scale and understand like, you know, not just how do the learning rate and data and parameters change, but how do very specific hyper parameters change? Like what is the depth versus width that you should have? Like for this particular type of regularization, like how much exactly should you have and how is that changing?

And that goes back and kind of informs like, okay, like what is actually happening under here? Like is weird that like this particular trend holds over scale. It seems like it needs less and less of this. Like that's kind of interesting.

Why is that? And sometimes we'll see a paper that's like, oh, that's it. I see what's going on there. That's nice.

So we're getting more and more I think collectively as a machine learning community, we're also starting to understand these things a lot more. I think when people point at neural networks or language models, it's like, oh, nobody understands. I think that's quite a mischaracterization of it. There are a lot of people that have a lot of very good ideas about how these things work.

And nobody on this call probably knows exactly how a car works and that you don't think you can make a car from scratch. I certainly couldn't. They're especially modern cars and they were quite complicated. But we can use cars to go where we need, I mean, like roughly no other work.

So it'd be weird to say like, oh, we don't know how cars work. I think machine learning networks are a lot more like that. And most people kind of go to spread it more. What's up friends?

Is your code getting dragged down by joins and long query times? The problem might be your database. Why simplifying the complex with graphs? A graph database that you model data the way it looks in the real world instead of forcing it into rows and columns.

Stop asking relational databases to do more than what they were made for. Graphs work well for use cases with lots of data connections like supply chain, fraud detection, real time analytics and genitive AI with Neo4j you can code in your favorite programming language and against any driver. Plus it's easy to integrate into your tech stack. People are solving some of the world's biggest problems with graphs.

And now it's your turn. Visit neo4j.com slash developer get started again neo4j.com slash developer. That's n-e-o4j.com slash developer. So Josh going into the break, you had a really good analogy there about the fact that the sophistication of cars means that while we all use them all the time, we may not understand every aspect of them.

And I wanted to go back for a moment because I've been kind of percolating on some of the things that you said earlier. And you've been talking about kind of the trust and robust systems and all but I was wondering, I know in my own life, I'm very involved in the trustworthiness of models and you talked a bit about getting good outcomes and being able to tech that. Do you have any guidance on what it means to engineer trust into model training? So many organizations that I've seen kind of tag the trustworthiness of models on at the end is that, oh yes, we have to do that too.

And with you have such a insightful and deep way of approaching engineering, rejecting the black box approach, any guidance you have on how you engineer trust in it from up front so that as you get through the training lifecycle, you come out with something that kind of you have a high degree of confidence is what you're intending it to be. I think a lot of people are trying to do this and it's been working to be done there and you can do things to improve the models that make them more trustworthy and during training. And that's great. But I think by far the largest place that we should be focusing is actually after training.

We don't trust people because like, oh, I looked at their schooling and like, you know, they seem real trustworthy after this point. Like, I'm going to give them my credit card. I'm going to give them my bank account. Like, no, you know, we're going to be like looking like, what is this person doing?

You know, okay, we're going to be checking things afterwards. Like, you know, there's a lot of other stuff that needs to happen post-training and in deployment where we can actually trust things. So I think for me, it's actually a lot more about like what is happening when you're actually using the model, like what kind of auditing or real-time verification or user interaction or other sorts of checks or things that you have. Can you have other systems that are checking the behavior of this?

Or are you an agent, you know, maybe you'd want to predict like, is this action going to have a potentially going to have a negative consequences? Or is this going to be potentially dangerous? Or will this be something that the user might not want? And those seem like good things that have totally separate systems that are completely unrelated to the development of your original model.

You would not want the original model to be responsible or connected to this at all. You want to have a totally separate thing that's looking at this, right? And so I think trust is better thought of as like a set of different types of data that can give you confidence that things are going well that have gone well and will continue to go well. And so you can only get so much trust up ahead by kind of designing the system in a particular way.

And then you can understand like, what is that model good at? What distribution was it trained on? Have we shifted from that distribution? Have we shifted from the task that it's good at?

How well has it done over time? Is it likely to, you know, go wrong in this new example? So I think it's more of a post-training, more of a practical kind of a problem. And the idea that we could like solve this all by making like safe or trustworthy models is a little bit, it's going to be difficult to succeed at that task.

Maybe this ties into the trust element, certainly the kind of collaborative approach with agents. But you do talk also a lot about some of the thinking that you're doing around interfaces as well. And it sounds like you've also been utilizing or trying to utilize some of what you're developing internally for coding and other things. So what are you thinking about in terms of interfaces and kind of how are you dogfooding some of those things internally to kind of learn about interfaces beyond the kind of AI chat interfaces that we're all familiar with?

Yeah. So I think the learning internally from using our own kind of prototypes and internal products and demos has been quite a lot of that. Like without actually using it, it's hard to kind of get this learning about like, okay, is this trustworthy or not? Or does this actually work?

Like what UI do I want to use for this? I think when I made some prototype, it generates a bunch of code and very quickly I started to realize like, that's great. But like it's really annoying to review this much code. I see a lot of products out there that are like, oh, look, you know, like make a PR for you.

Yeah. I mean, how fun is it to review a PR of a few hundred lines if there's a few lines that are wrong. You have to search through for this bug. It doesn't really tell you anything about where it is.

Like this is just a really awful user experience. And so I think instead if we approach it from the perspective of like, okay, what do I want as the user here? What I want is for this to be pretty interactive and for this to tell me like, okay, maybe there is a bug here or yeah, you asked me to make this PR, but like your ask was like kind of ambiguous and I needed to make some assumptions using assumptions I made. Like here's how confident I am.

Do you want to change them? I guess I do. Once it's more interactive, once you're going back with the user and trying to flag places of ambiguity, uncertainty, risk, et cetera, to the extent that you can be correct about those, it can make the user experience feel a lot better. Any anecdotes from your own sort of internal experiences with these or things that you've tried either on the positive or negative side?

One thing that I really like about copilot just as an example is that it keeps it short. It's easy to review. I think when copilot style things make these huge generations, that's why they normally don't because it's kind of hard to review it and to trust it and to do that. I'm imagining that people are probably going to get to a world where they realize like, okay, this is kind of annoying.

Maybe you could point out places where there are potential bugs. Like can you just tell me what lines seem like the most suspect? So we, for example, made some internal error checkers and some sort of highlight like, okay, yeah, this thing's not even important. Your editor does this for you.

You can also highlight things like, hey, this spec doesn't look like it was actually properly implemented here or this function specification is kind of ambiguous for these edge cases. Do you want to take a look at that? A lot of the work that we've done for our evaluations is related to this as well. So when we look at evaluation data, most of the time when systems fail is actually from under specification and not from, oh, the model like messing it up fundamentally.

It's more like as a user, I didn't really decide what I wanted. But I think one thing that's really interesting to me is that coding is not really about like pure correctness and it's like abstract mathematical form or it's like a perfectly correct version of this. The version of the function that you want and I want are actually slightly different. Like what I want in a moment might change, you know, from moment to moment as well.

And so the user like really needs to be connected to that. And as it happens, I also, you know, learn about things from like, hmm, yes, you did exactly what I wanted, but that turned out to be not a good idea. And so I think the user needs to be there and able to learn and refine like what they even want and what's even possible in the world. So you pick my interesting in there.

As a coder myself, who makes all sorts of errors in my code constantly, is you're doing that and you're kind of changing the workflow over time of how the coder is spending their time and then ultimately potentially how they're thinking about coding is they adjust to the new approach that your tools are doing. How does that look for the coder going forward in terms of how does it change their day-to-day experience of coding? Are you able to rescue me from spending 90% of my time coding errors and forever trying to get myself back out of that hole? That's really like the vision for in-view and for the company and for the work that we're doing is can we get to a place where people, not just coders, but other non-technical people, can effectively write higher level pseudo code or intent and actually have this translated into real code into something that actually makes your computer do what you want.

That's why when we're talking about making a new personal computer, et cetera, like we're really at the end of the day, the thing that is missing is the ability to robustly write the software. And we can as software engineers get down in the details like, you know, we spend a lot of time fixing our own bugs, et cetera. And our goal is to make it so that as a user, you can keep working at a higher and higher level abstraction and feel confident in that. Right now you can work at a super high level abstraction to say, like, make this whole thing for me.

It doesn't work. And so that's not very fun because it's busted and now you like, how do you get into details, et cetera. So how can you make it robust enough so that you can work at a higher level of abstraction and trust that this part was actually correct and be able to have that dialogue back and forth and like, okay, you know, maybe it's not quite working like I want it or maybe it's not possible to do this thing or not as easy to do it in the way that I wanted to do it, et cetera. So how do you have a dialogue and help educate the person about what is possible, what isn't working, what might not be working, what is it again.

So it changes the workflow and I think we're interested in how you change this work and it's slightly more incremental way. You could just say, oh, we're going to have the AI system do everything for you and magically try and figure it out. But I think from our previous experience, we don't think that these types of products are nearly as good to use as a user experience. Trying to fully automate something kind of is disempowering to people and also results in a worse experience and a worse product.

So we're more interested in this interactive dialogue tool that as a person, I'm trying, maybe you can just write a line of pseudocode, you get a big block out, it tells you like one line that is potentially problematic for you to look at or maybe it just gets it right. Okay, great. You move on to the next one. Like one way that you can think about writing code is like, right, you can do it.

But there's other ways you can write it. You might also write a command, like, you know, change the file to lots of blocks, or you might also, you know, say, like, make this function more robust, or there's like lots of different ways that you can interact with this and how can you give people more tools, more like paint brushes for being able to change code and ultimately, like, make the computer do what they want. I think the thing that's really exciting about this is that when you can robustly write software, what you're really doing is being able to create agents that can do a huge swath of tasks. If you're not able to write robust software, then the only way your agent can interact with your computer is with things that we have already programmed as actions.

Like, okay, we've programmed to go to a website and click a button. That's it. But if I can write software, now I can do some huge set of things and even things that you've never intended or programmed in the first place. So for us, like, agents and writing code and reasoning are all, like, intimately connected.

I have one more tiny follow-up to that. It's a personal thing I run into all the time and having some of your expertise. I want to throw it at you. Does it make a difference?

Is most software developers, including people in the AI space doing models and stuff, you know, they write in Python, they write in usually a variety of different languages? And as I shift from one to the other, I find that some of the capabilities that are currently out there, they are great on Python because everyone on the planet is writing Python. But if I'm writing on something that's slightly more obscure, maybe even something big like Rust, it struggles to do the exact same thing that it can do flawlessly on the Python side. Do you anticipate a time where that contact shifting no longer applies very well and that they're all high fidelity in terms of what they can do?

Are we always going to be dogged a bit with the obscurity issue of certain languages? It might go the other way. It might be that, like, because it's so much more robust in Python, we should only have to write Python. And so what we do is we just write Python and we make a Python to Rust converter.

Or make a thing that assembles Python to assembly or whatever. It might be that it's better to double down on a really small set of things that we've made tons of data for and works really robustly because you get a better user experience. One of the things that a lot of these models circle with now is you have different versions of Python or Ubuntu or whatever. Things are different.

How is this supposed to know what version you're using? And so there's this combinatoric explosion of complexity that comes from all these different possibilities. And alternative way to do this would be to say, let's not do that. Let's just say you've got a Ubuntu 2.04, you've got this library version, you've got that one.

If you do this, I think it might work a lot better. So it could go actually in the other direction. Instead of making it more robustly on all these things, we might say, just all work in that level and worry about what language it writes. Maybe we only write this high level, we never even look at it anymore.

So we don't care if it's in Rust or Python. I think once that happens, once we sort of abstract it up a level, then you might be able to come back and say, why are we writing this in Python? This is not a type of language. It's really slow.

Why don't we change it to be a language that fits better for language models? And that might be an even better future thing. But that would require generating a ton of data to make this actually work. So I see that it's probably a future thing.

Not a thing to focus on right now. But that's my guess is how it all all. And also an alternative world would be, it gets really cheap to just generate all this data. So we just make a converter from our old work type on pre-training data to just make it do it in JavaScript and Rust and Elixir and whatever all the time anyway.

So fine. We just like trying to become all these. I don't know. We'll see which way it goes.

Yeah. Well, Chris will be happy if anything stays in Rust. I'm sure you're happy. We just started working on our official Rust client for prediction guard, Chris.

So you can be a beta user. There you go. It's been great to talk through. Again, I love this concept of this sort of full stack approach that you're taking and triggering things in my own mind to think through in my own work.

But as you look forward, either you personally or you at M.B.U. look forward to kind of the things that are happening this year, either in the community as a whole or at M.B.U. what's kind of most exciting for you that you see as a possibility kind of coming into the future, whether that be multimodal stuff or new types of agents or products or directions that the community's going or the research is going. What kind of stands out to you about that as you look to the future?

I think the thing that is going to be most exciting over the next year or two, at least for us internally and probably for other providers externally, is I think we're going to make really good progress on what we have been talking about today on actually reasoning on robustness. I think once you can get to a place where you ask this question and you get back and answer that is really correct and robust and grounded, it's not just, oh, it's like, yes, it has all the right reasons. It kind of understands the nuance of like, okay, yes, ish, but there's a little bit of complexity here. You can ask follow up questions and those are also right in robust.

That ability to robustly reason and answer questions is going to unlock some huge amount of work that I think people are not really anticipating. Once we really have the ability to robustly reason through scenarios, now we're talking about a lot more labor displacement and disruption than we were before. There's a lot of jobs that all of us can pretty easily put together, well, first I do this and I do that and I think about this. It only takes one person to do that when you have these tools that are that powerful.

I think there's going to be a lot more change in this area than people are really expecting right now. It's not to say that all jobs disappear or something, but the nature of work might change pretty dramatically and we might have much more powerful tools than I think people are anticipating. Right, yeah. Well, we're really happy that in view is thinking deeply about those things as we look to the future and at a really practical and useful way as we look forward.

So thank you for doing that. Thank you for your research and for taking time to join us. This has been great. Yeah, that's been great.

Thanks a bunch guys. All right, that is PracticalAI for this week. Subscribe now if you haven't already head to practicalai.fm for all the ways and join our free Slack team where you can hang out with Daniel, Chris and the entire changelog community. Sign up today at practicalai.fm slash community.

Thanks again to our partners at fly.io to our beat freaking residents, breakmaster cylinder and to you for listening. We appreciate you spending time with us. That's all for now. We'll talk to you next time.

Share this episode

Similar Episodes

Milk Proteins without the Dairy - Adam Tarshis and Dr. Cory Tobin

Jun 9, 2026 ·50m

New Technology in Severe Burn Care - Dr. Katie Bush

Jun 1, 2026 ·31m

New Methods in Early Cancer Detection - Dr. Nate Montgomery

May 25, 2026 ·39m

Strategies in Combating Chronic Kidney Disease - Dr. Salvadore Viscomi

May 17, 2026 ·37m

AI and the Future of Healthcare -- Dr. Emilia Javorsky

May 8, 2026 ·39m

The First Environmental GE Organism Release - almost! Dr. Steven Lindow

Apr 28, 2026 ·25m

Similar Podcasts

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 47 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on May 15, 2024.

What is this episode about?

There’s a lot of hype about AI agents right now, but developing robust agents isn’t yet a reality in general. Imbue is leading the way towards more robust agents by taking a full-stack approach; from hardware innovations through to user interface....

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.