MLOps and tracking experiments with Allegro AI

What this episode covers

DevOps for deep learning is well… different. You need to track both data and code, and you need to run multiple different versions of your code for long periods of time on accelerated hardware. Allegro AI is helping data scientists manage these workflows with their open source MLOps solution called Trains. Nir Bar-Lev, Allegro’s CEO, joins us to discuss their approach to MLOps and how to make deep learning development more robust.Sponsors:DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog. Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com. Rollbar – We move fast and fix things because of Rollbar. Resolve errors in minutes. Deploy with confidence. Learn more at rollbar.com/changelog. Featuring:Nir Bar-Lev – LinkedInChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:Allegro AIThe “Trains” PlatformTrains demo serverTrains video tutorials on YouTubeUpcoming Events: Register for upcoming webinars here!

of MATCHES

TRANSCRIPT · AUTO-GENERATED

When we talk about MLOps, we talk about the ability to move from the data scientist's own machine or laptop to training models at scale on some remote machine cluster. We're talking about the ability to orchestrate that and do that within a larger team, not just a single data scientist. And we're talking about the ability to automate that process. That, in general, is what we talk about when we talk about MLOps.

How is it different than DevOps? Well, it's actually very different. Dan with your ChangeLog is provided by Fastly. Learn more at Fastly.com.

We move fast and fix things here at ChangeLog because of Rollbar. Check them out at Rollbar.com. And we're hosted on Leno Cloud servers. Head to lino.com slash ChangeLog.

This episode is brought to you by DigitalOcean, Droplets, Managed Kubernetes, Managed Databases, Spaces, Object Storage, Volume Block Storage, Advanced Networking with Virtual Clouds and Cloud Firewalls, Developer Tooling with a robust API and CLI to make sure you can interact with your infrastructure the way you want to. DigitalOcean is designed for developers and built for businesses. Join over 150,000 businesses to develop, manage, and scale their applications with DigitalOcean. Head to do.co slash ChangeLog to get started with a $100 credit.

Again, do.co slash ChangeLog. Welcome to Practical AI, a weekly podcast that makes artificial intelligence practical, productive, and accessible to everyone. This is where conversations around AI, machine learning, and data science happen. Join the community and Slack with us around various topics of the show at ChangeLog.com slash community and follow us on Twitter here at Practical AI FM.

Welcome to another episode of Practical AI. This is Daniel Whitenack. I'm a data scientist with SIL International. And I'm joined, as always, by my co-host, Chris Besson, who is a principal AI strategist at Lockheed Martin.

How are you doing, Chris? I am doing okay. It's summertime here in Georgia, and so it is hot and humid, and so I'm just trying to keep from melting. Yeah, not unexpected for where you are.

Happens on a regular basis. Hot, humid, not the melting part. Yeah, so I think I mentioned this a couple times on the podcast with my iPhone, the candle business. So there's always this, like, during the summer, you've got to figure out the right shipping and tracking so that you kind of minimize the likelihood of candles melting on people's porch before they actually get into their house if you're sending them to, like, Texas or Arizona or that sort of thing.

That's a good point. Yeah, so it's an interesting thing. On another, you know, shipping front, I've got a pile of boxes sitting next to me and a computer case. All the components for a computer are here at my house.

I'm about to build a first AI workstation of my very own, so I'm excited about that. Very nice. It would be fun to have an episode, you know, detailing all of the mishaps that happen along the way as I hopefully don't ruin it but get this thing running, I'm sure. So I'm curious, since you brought it up, I know in the past when we've talked, we both have typically gone to cloud services, especially for personal things that we're doing at home for our own interests.

What caused you to decide to go this way this time with a desktop? I think it was twofold. I think partly it was, like, I haven't built a computer since I was in college probably, which, I don't know, that's over 13 or 14 years probably, 15 years since I built a computer maybe. I thought it would be fun to just do it again, so that's partly just fun.

But then also I'm getting into a lot more of audio models, so speech recognition things and spoken language identification, and the data sets associated with those are quite large. And so sort of carting those around to various cloud machines and also running models for, you know, maybe days instead of hours starts to get fairly expensive, so I think those two things made sense to me. Well, good luck with it. Yeah, we'll definitely have to get an update from you to share with us all what happened and what went wrong and what went well.

For sure. And today we're going to keep the practical train moving with some more topics that are extremely practical. Actually, I had seen what we're going to be talking about today, which is some tools from a company called Allegro AI. One of my friends pointed me to that, which I'll mention maybe a little bit later on.

But I also saw PyTorch mention recently that Allegro Trains, which is one of the ML Ops and experiment managing versioning things that we're going to be talking about today, joined the PyTorch ecosystem project. And I thought that sounded really exciting, also very practical. So today we've got with us Nir Barlev, who is the CEO and co-founder of Allegro AI. Welcome, Nir.

Thank you for having me, guys. Yeah. Before we jump into all of those exciting things about experiment tracking and versioning and ML Ops and all of that, it'd be great to hear just a little bit about your background and how you got involved in this field. Sure.

And I've been in the high tech industry for longer than I care to actually say, probably three decades. I started as an engineer, actually, and spent about a decade on large ERP systems and that kind of thing. This was way back. And then, by way of an MBA warden, I joined Google and a decade at Google, doing everything from working on mobile team.

This was right after Google had bought Android and before the iPhone went out and actually helping setting up Google's Tel Aviv R&D Center to leading Google's European search advertising strategy and a number of other roles. And after all, I was a GM of mobile payments. And yeah, when I decided to look for something else to do, I joined two folks who are actually my partners now to basically start Allegro AI. The way I came at it is I was looking to do something big that can impact the world and that would involve cutting edge technology.

After being at Google and doing everything I did, you don't want to do anything less than that, really. Yeah, I was going to say, being at Google, if you're thinking of projects that make an impact, this sort of worldwide impact or innovative, it seems like that sets a pretty good trend for your path or a high bar to reach, for sure. That's correct. I don't know if I'll be able to build a company as large as Google, but certainly that's the target you want to put for yourself.

Yeah. Yeah, and it is interesting. I mean, it seems, and I don't know if you have a perspective on this as a CEO and founder, but it seems like there are a number of really innovative startups in the AI space that are kind of playing at the same level as the major players of, you know, OpenAI and Google and Microsoft and these. At the same level at major research conferences, you see kind of startups, I'm thinking of like Hugging Face or those sorts of startups that are really right there.

And it seems like such a huge impact for a small team. And so, I don't know, as a CEO, if you think about those things, but it seems really interesting to me that there can be these small, really focused teams that make a very large impact on that level. Yeah. You know, being at Google, you kind of think that you can probably do anything, right?

The reality is that, and I've seen that personally on some of the big projects that I was involved with, is that, you know, Google didn't execute as well as a startup or as fast. Google ended up acquiring them for less or even a lot of money, right? And there are a number of examples for that. And as a company grows larger, the targets get bigger, right?

And so doing anything requires a very, very high bar. I remember at some point, I'm talking about like 2007, I think. I remember pitching something to Susan Maciejski, who was at the time, and now she's the CEO of YouTube. She was the head of advertising.

And the bar was, you know, if it's not about $100 million revenue, don't talk to me about it, right? And you can imagine this is back in 2007. So imagine today. And this gives an opportunity for small companies who are very nimble to identify opportunities.

There's also a different perspective when you're outside of Google as when you're in. Especially, I think, in the B2B space, there are opportunities where, you know, at least Google specifically is still relatively behind companies such as Amazon, for example, Microsoft, right? So you can identify small companies' niches, and if you understand that those are going to grow, then that's an opportunity. So kind of curious, as you were at Google, and, you know, how did you come up with this idea for what would become Allegro?

And so you're kind of, you've been doing that at Google for a while, moving through your position. So what made you think, I have this idea? I'm going to make a major change in my life. You know, what gave you the motivation to go off and do a startup, find partners?

Can you give us a little bit of that backstory? First of all, I can't really take credit for the original idea behind Allegro AI. That's actually one of my partners. You know, I guess I can take credit in what we formed out of it and what it became, because obviously, as an end company, especially startups, you know, we change and we adapt quickly to find product market fit.

So obviously, the vision, as it was set or thought of by my partner, needed to improve and get better, and, you know, that's something that I was involved in. But the original idea was not mine. It was more of, for my position, you know, I felt like I had the, you know, I'm in Tel Aviv, Israel, and it was about relocating my family back to the U.S. And at the time, this was about four years ago, it didn't make sense.

And coupled with the fact that, you know, I joined Google when it was 3,000 people. I think it's about 100,000 now or so, you know, on that kind of scale. It's a different company in many ways. And I felt like it was an amazing experience, and I learned so much, you know, especially, you know, being at Google in that time of growth.

But, you know, when I left Google as a big company, with all the things that we all less like about big companies, and I felt like, you know, this was an opportunity to do something different and really go out and try to build something on my own. As I mentioned, I lived for something really big that would change the world. And, you know, basically, as a potential founder, I started, you know, quote, unquote, dating people, right, to find partners that we could, you know, come up with something that we'd like to do. And through that, quote, unquote, dating process, I met my current partners, and it was, you know, we hit it off, as you like to say, really quickly.

They're amazing guys. When you're in a startup and you have partners, you're practically in a Catholic marriage for the time until the exit. And so you want to make sure that you have people that you can trust and that they're, you know, great people that you can work with and obviously amazingly capable and talented. And I found all of that with them.

And basically, you know, as I mentioned, one of my partners was the one that, you know, was bringing that idea. And it came to him. He's a longtime serial entrepreneur. He's a very interesting profile where he has both a very, very strong engineering background as well as a data science background.

So the most prestigious lab today in AI in Israel. So it's run by Professor Wolf, who's actually now on Facebook. And my co-founder, his name is Moses Goodman, was his first PhD student, which basically means that they set up their lab together. And so he's really one of the pioneers of deep learning, machine learning, computer vision in Israel.

And he basically saw what Allegro really is all about is the fact that we need to bring in engineering methodologies into the AI in the process. That was not the way that he said it at the time. But basically, that's the idea, right? How do we actually scale things up?

And I'm kind of curious on that front. And you stated it well. And I think that this has been brought up on our show multiple times from different perspectives. So I definitely think it is a theme that's kind of surging through the community that we need to be more rigorous in terms of the engineering we put into our workflows and the AI-driven products that we're building and putting out and the tools and all that sort of thing.

I was wondering, from your perspective, what you see as the challenges to kind of what are the sort of main challenges to getting people on board that are currently in data science and AI positions and kind of convincing them that they need to start doing things differently? What are some of those challenges? Does it have to do with this kind of variety of backgrounds that people come from? That it's not just engineers?

Or is it more than that? Yeah, that's a great question. And the answer, actually, is moving targets. Because our industry, right?

The one that you guys are talking about and the one that I'm squarely in is rapidly evolving and changing, as we speak, at an amazing rate. I've never experienced that kind of rate before in my career. So I generally say this, right? I mean, basically, right, it's a very different paradigm, right?

It's a scientific paradigm, right? And initially, people thought, well, you know, I'll get data scientists or research scientists. And that's what I need, right? And then we'll be able to do the job.

Obviously, we know that's not enough. The thing is that there's still, you know, a core and critical part of a team that needs to build something. But data scientists, research scientists have a very different, you know, mindset and outlook, right? I mean, they've been trained differently, right?

I mean, at the end of the day, they're scientists, right? And if you actually, you know, take it to the extreme, think of that, you know, mad scientist, nothing is in order. You know, everything is hectic. It's all about the creativity and finding the solution.

And there's a lot of truth in that. Obviously, that's an extreme exemplification. But there's a lot of that. And that's changing.

But we found, you know, throughout the course of the last three years, you know, data scientists, research scientists have been very much against adopting any tools because, you know, they came out of university. They were focused on, you know, on the science. Tooling, they didn't understand the value of tooling, the value of processes. In some ways, you might say, you know, maybe even, you know, we're a little bit wary of tools.

It's not going to be good for them or bad for them. It wasn't even something that they were exposed to during the curriculum or their training. On the flip side, you know, they felt like, you know, I'm a, for example, PhD out of whatever Stanford. I mean, I should know everything.

A lot of times we saw relatively very junior data scientists leading AI teams, not just in small companies, in very large companies, right? Because if you're not a Google or a Microsoft or a Facebook, you're not going to get the cream of the crop. And the last thing is, you know, their bosses didn't know what the heck they were doing. They didn't even know how to actually measure what they were doing.

And as I mentioned, they thought that, you know, bringing those people in would be enough. And so a lot of that created the situation where, you know, the background of, you know, why do I need tools? And a lot of that still exists now, I think. But, you know, a lot of people who have engineering background are actually sort of doing data science or, you know, engineering and data engineering because, you know, because it's new, it's interesting, salaries might be higher, et cetera.

Companies have realized that they're not seeing productivity out of the data science teams. And so that shift has been happening in the last, I guess, year, year and a half. We've seen companies integrate, right, their data scientists and research scientists into a larger product team, right, that has the engineers and the product leadership, et cetera, devops, to really, you know, push them to ultimately build a product. Because it's not about coming up with a research paper, right?

Ultimately, if you're spending a company, most of the time it's about building a product or a service. And so I think now what we've seen is oftentimes a situation where there's a very big underappreciation of what it takes to build a state-of-the-art tool chain to support you. And I remember talking to someone that was way back in the day was pushing, you know, SQL databases. Imagine that, right?

I mean, this is prehistoric times. And he was telling me how he had trouble pushing that into organizations because they thought they were going to build it themselves. Obviously, you know, anyone trying to do that fell flat on their faces. Same thing here.

And we've had situations, you know, that's changing a lot. But we've had situations where, you know, a couple of years ago, companies with us were talking about, like, I can build this in three weeks. I mean, they may believe that, right, after we showed them what we built. You know, today, a lot of these companies, because tools didn't necessarily exist or they weren't aware of them or they thought they could build it, you know, have invested internally and built something.

And you know how it is, right? Not invented here. And once you build, like, a small tool, you're enamored with it, especially, you know, but for this sake of us, engineers, right? I remember the energy.

I was enamored with some of the things that I built. And so that's kind of the hurdle that we as an industry that, you know, is building and pushing tools need to try to get around the world. So, Nir, I guess as we were starting to get into tools and you were talking about, you know, whether organizations were starting to recognize the need for tools and how did they get productive and measure that productivity. And that, you know, in a world that, you know, already has things like DevOps and ML engineering and data engineering and such, we're kind of moving into that area.

I noticed that, you know, kind of front and center on your website, you have this concept of ML Ops. And as you were mentioning, DevOps and passing in the tooling before, it really kind of triggered that. I'm wondering if you can kind of tell us what ML Ops means to you and the organization. and kind of how does that differentiate itself from DevOps on the software side and other types of ML engineering and data engineering?

Absolutely. So actually, that's a great question because it actually touches on one point where, you know, ML Ops itself as a term is not something that is set already and different companies are using it to mean slightly different things. You know, that's one of the, I guess, issues, again, with our industry. So early on, that terminology is not set.

When we talk about ML Ops, you know, we talk about the ability to move from the data scientist's own machine or laptop, right, to training models at scale on some, like, cluster, like remote machine cluster. We're talking about the ability to orchestrate that and do that within, like, a larger team, not just a single data scientist and we're talking about the ability to automate that process. That, I guess, in general, is what we talk about when we talk about ML Ops. How is it different than DevOps?

Well, it's actually very different. So I guess, you know, let's define DevOps at a very high level, right? I mean, basically, the idea behind DevOps is that you want to make sure that a piece of software that usually is already tested, key-weighted, and stable, right, that has left the development and is now going into production to serve users or, you know, workloads needs to work at scale and to stay up all the time, right? And you need to make sure that it can sell on cost of machines, et cetera.

At the end of the day is what DevOps has to do. And so, basically, what do we say here, right? We said that there was a single piece of software that it was tested and it works and that you need to take that and you need to scale that up and replicate that, right? And that happens only in production.

Well, in AI, everything around that is actually different. So, first of all, as you guys know, right, machine learning, deep learning experiments can be very, very heavy workloads. I mean, you actually mentioned that yourself when you talked about building your own computer at the beginning, right? You're going to run things that are going to take hours, right?

Or even days. And so, unlike regular software, you need to be able to run stuff on large machines from day one, right? You're doing lots of experimentation until you reach your goal. And so, with experiments, you're basically running pieces of code that are slightly different from each other.

And that's a different thing than running the same piece of code on lots of machines. And so, basically, this is a very different problem. How do I, as a data science team, manage my workloads on clusters of machines? How do I handle lots of experiments that I need to run from one or more data scientists or a team of data scientists and do that effectively when we're talking about pieces of code that continually change?

How do I actually take the environment that I built? Because in AI, again, the piece of code that you're running actually is much more complex on one dimension than a regular software because really it's an amalgamation set with the data, right? How do I actually take that environment that I built, my research, you know, X researcher, right, and model that she built and then run it on a remote machine that has a different environment. And so, all these are different challenges.

And this is the challenges that we attempt to solve. And this is what we call DevOps. Yeah, that's a really good summary. I like how you set that up in terms of the comparison to DevOps because it is kind of maybe a shock for people starting to get into this field where, like you say, from day one, in order to actually make progress on their things, they might have to know about, oh, spinning up this GPU instance in the cloud or, you know, CUDA libraries and running things in a repeatable way.

It seems like a really high barrier for people to overcome, you know, from day one to get things working and also do it in a repeatable way. Yeah, I also wonder on that, like you're talking about experiments and that sort of thing. I know one thing that is definitely true of myself and, you know, my wife could confirm is that I'm not very good at remembering what I've done or what needs to be done, right? In terms of the experiment tracking side of things, of course, there's like the running of things, which is definitely important.

I think that's what maybe you focus on mostly, but there's also kind of this weird documentation almost piece of the puzzle. It's not quite documentation because it's like a very specific type of documentation that's really documenting like what have I done and what haven't I done and how successful was that? And it's not really like you want to have a research paper necessarily, especially if you're developing these things as a product or maybe even a trade secret, especially if you're on a team and want to have that common understanding of what has been done and hasn't been done. How soon do you see teams encountering that issue when they start working on this problem and what are those kind of essential elements of, I guess, more of the documentation or tracking side of things that need to be in stone somewhere over time?

Yeah, well, you know, as you were saying about documenting and how it's not exactly documented, if you come up with a term, please let me know. Naming things is the hardest thing. Exactly. Doc Ops.

Yeah. Doc Ops. That's probably already taken. That has to be taken.

That's a real issue. You know, that's a real start to name it. So I'll pretend to say actually, you're saying, you know, we're focusing more on the ML Ops. Actually, you know, one area where we're pretty unique is that we have a very highly integrated solution where we think that you can't focus on just one thing if you don't have a highly integrated platform that actually takes for both the expert management, the data management, the versioning, and the ML Ops, you don't have the best scalable solution, but we can talk about that later if you'd like.

The expert management part of it, the documentation part of it, when do people realize that they need it? The answer is actually when someone in the team that usually has some sort of engineering background says, stop, this is crazy. You know? Yeah.

That's exactly the point. I remember Doug, if you're out there, his name was Doug. He's a great engineer, one of the startups I work with. He was my wake-up call.

And so, you know, I mean, it could happen with a team of one and it can happen, we've seen it happen, well, we've actually seen teams of tens of data scientists that didn't have that, right? And it really depends if you have that person who realizes that and has the influence and or power to actually say, you know, we need to change this. Yeah, yeah. And I guess this is something that we've kind of talked about in passing, but that's this interaction between AI developers or data scientists and the rest of an engineering organization.

So maybe a follow-up question to Chris's question about differentiating MLOps and DevOps. What is the kind of integration point from your perspective between the two worlds? Because if things eventually end up in a product, right? Like I'm importing a model into some API handler in some code that is production product code, there has to be an integration point somewhere.

Where does that exist and what challenges are at that integration point? That's a great question. Actually, the integration is something that happens continuously if you're actually running things well. So it's exactly the same, right?

I mean, ultimately what you want to do is you want to take this model that you built to predict something or to sell something and then integrate that into called a wrapper or some larger piece of code that actually carries out the ultimate task of that product. The thing is, oftentimes, you could test your model a lot kind of like in a very environment that's kind of clean, but ultimately you're going to want to test it in field and you're going to have to have that wrapper. The other point also is that once you want to get into automation, right, and even if you're still within the data science part, if you want to get into automation and create lots and lots of experiments and you want to maybe actually fielding in continuously new data that's coming in, let's say you're building a autonomous vehicle and you're getting constantly new videos from your cars driving around and you want to actually improve your models based on that, then that also creates an integration point. So the integration point is on those two levels.

One is when you have to hand over the code so that it gets wrapped and two, when you actually want to integrate those experiments within a larger pipeline that helps improve them. And there's another point that we actually try to facilitate with our product, which is how can I lower the barrier to entry? And I'll explain. Let's say you're a company and you're building a solution to, I don't know, let's say computer vision is easy.

Let's say you're rebuilding something to identify cats, let's take the ultimate example. But you also need to identify dogs because you're building a path detector, whatever. You're speaking right to Chris's heart. I was keeping my mouth shut this time.

Yes, Daniel can't normally shut me up on that. So go for it. Let's hear it. As a scientist, you understand that if you've built a model and I'm talking about the code right now that facilitates object detection for cats.

Well, if you now want to do the same thing for dogs, what you need to do is you need to take that code that you built for the experiment and probably the same neural network that is the one that you chose for identifying objects in whatever scenario and now marry it with a different data set. That's it. Why would you necessarily need a data scientist for that? Why couldn't an engineer do that?

And that's behind a lot of the stuff that we're doing also. The ability to actually have the data scientists work on the core path detector model and then have engineers facilitate optimizing that for the different objects. Yeah, and I think that also actually that example itself illustrates another kind of unique feature of this. I think you're right in that those later stages could be kind of the popular word I guess is democratize to other people within the organization, right?

But also it's still not quite the same as like a normal DevOps in that like if you're running with a different data set somehow you need to have a kind of unique tracking that's going on with like what data set was used to train this particular artifact or serialized model you know at what time because the code might actually be exactly the same, right? The difference might be in the data. Exactly, exactly. Yeah, so I see so many people like develop really sophisticated kind of naming for their files and such which you probably need your own documentation to document that.

What about the data side? We mostly kind of talked about process and the operations infrastructure. What about the data side of things? So data is the holy grail.

At the end of the day and I think that you know obviously experienced and senior data scientists get this right? It's all about the data. And all this is are focused more on models but at the end of the day you know the difference between a product that meets the threshold of you know whatever KPIs you want it to hit and something that doesn't is about your ability to you know train it on the right data set right? And be on top of your data and be able to feed the exact what we call data view to train that model.

And so iterating on the data right? Identifying the skews within the data and handling those identifying the holes where you need to add more data or you know build synthetic data like there are augmentations around that. That is the key piece and as you know we talked about it's an experiment process and so being able to actually version that and track that because as an experiment process you know you're going one track and then you realize you know what actually you want to go back to the model I built two months ago and actually take a different direction you have to be able to version that or not just the model you want to be able to version the data set and if you have enough experience as a data scientist you know that you're always going to find data sets that work better for whatever reason and you don't even know why right? You don't know why but whatever it is I mean you know there's so many examples of data sets that are quote unquote wrong because you know the metadata on them isn't necessarily correct but somehow they produce better results and you know the data set that's better right?

And so you have to version your data you have to version the data not just the metadata around that so that you can effectively go through that process and make sure if you like this show and you aren't listening to The Change Log hey let's fix that bug The Change Log is our flagship show and we've been doing it for over a decade Adam and I seek out and interview the people who are pushing the world forward with software we dive deep into the hacks the innovations and the leadership required to do what these amazing people do one recent example is our conversation with Anders Damsgaard a climate scientist from Denmark who gave us a peek inside his work and how he scratched a common itch he has when gathering academic research from around the web here's a dorky moment from that episode are you trying to be right or are you trying to solve the world's problems? exactly if your scientist is trying to be right well then your right may not actually be the right yeah exactly there's another saying all models are wrong but some are useful I like that one there's another saying all models are wrong except for mine we had a lot of fun with Anders he's a fascinating guy continue listening at changelog.com slash podcast slash 378 or search for The Change Log on your favorite podcast app and find the episode called Open Source Meets Climate Science so before we got to the break you were talking about versioning the data and I wanted to let you finish that thought and then I actually wanted to also explore how Allegro is moving MLOps in a practical way like what you're actually focusing on and how you're implementing MLOps but if you'd finish your thought on data versioning we'd love to hear it sure with respect to data versioning at the end of the day we think that that's as I mentioned being able to have a set of tools that enables you to effectively manage your data sets and their versions and effectively also be able to obfuscate the connection between the code and the data so that we can facilitate for example the ability to move from a cat detector to an op detector because now you're using a different data set and again as a data scientist you all know that taking one data set with code and actually switching it to different data sets is not as trivial as one would like it to be and so those are some of the goals that we set out about to do with Allegro and really the ability to actually switch between the data sets and the code in the models as easy as plug and play yeah so what is the kind of I guess range of things that Allegro focuses on in its actual offering so I know that there's the trains project which was mentioned in that tweet that got me interested that joined the PyTorch ecosystem project so how does that fit into the wider scheme of what Allegro is offering and how does a data scientist interact with it I guess sure what we provide is we provide a platform or tool chain or set of tools that basically takes care of the experiment process the MLOps part of it the ability to actually scale and you know and actually run things effectively and the data and the full platform which isn't completely available as open source basically has all these key pieces together highly highly integrated what we've open sourced or what's available as an open source project is the experiment management part of things which is all about documentation we talked about the ability to document things version your models you know your experiments your parameters everything around that reproduce compare etc and everything that has to do with the I guess basic MLOps the ability to actually manage a cluster whether it's on prem or on cloud or combination by a team of data scientists and really self-help themselves with orchestration and scheduling etc and automation on top of that and then some basic actually it's not basic because it's on par with whatever else is out there but data management or what we call data tracking at least is available in the open source where the enterprise version adds on top of that much more sophisticated data management more sophisticated scaling and data pipeline on top of the platform so that companies can actually build a specific pipeline for what they need and obviously the standard enterprise relevant features like user management permissions managed services all that stuff so I'm curious as you're kind of describing this and I appreciate you kind of talking a little bit toward what was open source versus what was the enterprise there is a variety of ways they may implement how they are allocating resources for their own ML ops prior to you coming into the picture with them. You know, some people are strictly cloud-based. They may be doing, you know, Google or AWS or Azure.

Some people are, or organizations is more specific, maybe buying like a bunch of DGXs from NVIDIA and have a cluster set up locally or some hybrid form. Which of these scenarios does Allegro fit into? And if multiple, how does it change how you would implement Allegro? Actually, we fit into every one of those scenarios.

Any hybrid scenario that you can think of. And actually, the more complex your environment is, the more Allegro trains shines. And I'll explain. Basically, the way that Allegro trains is set up is you have a server backend that basically manages the processes and records and logs everything and then sets up the instructions for, you know, the clients that are basically what the data scientists connect with as well as the agents that run on the machines that do the actual training.

The system is built that you can set it up on any type of machine for training. It could be, you know, DGX. It could be any type of GPU by NVIDIA. It could actually be CPU.

It doesn't really matter. It can sit on the cloud, on-prem, any combination, on any cloud that you like. And it all works. In fact, a significant portion of our customers have a hybrid solution where they have on-prem systems and then they actually burst into the cloud, right?

When they have specific times where they need actually more processing power. And that becomes really effective for them. We have other customers that are completely on the cloud and on everything in between. Why Lego trains actually shines the more that you have a more complex environment is because, so on the first level is that the interface to manage these clusters is really, really simple.

You can actually try it out. We have a demo server up on the web. The data scientists actually manage queues where they can set up, you know, the machines. I want one, you know, GPU or I need a cluster of eight GPUs or whatever.

And it's completely invisible to them where those machines sit. With the enterprise version, we go even further and we provide three layers of software caching and what we call zero data move. So if you have a complex system where you have data in multiple locations, you know, we'll make sure the data goes to the right machine to train that's, you know, close by to it. We'll make sure that there's local caching to it that it doesn't have to go back again and again.

And so the data moves as little as it can. And we can actually, we go even further, you can actually do federated learning on our platform. And so you can actually have data being trained in multiple locations geographically around the world and then combined into a single model. Really interesting.

And I think you're kind of getting or hinted at some of these things, but just for my own understanding, it sounds like there's the AllegroTrain's server which kind of aggregates all this information as that kind of experiment management is kind of, I guess maybe the central brain is a way to think of it. In my understanding, like if I'm, let's say running, let's say I just have a machine, my own machine, and I have some code on it and I want that to be tracked by the AllegroTrain server. I think based on what I was reading, you just kind of decorate that code with a certain snippet that connects to the centralized train's server. Is that kind of the workflow for that scenario?

Exactly. Okay. We try to make it as simple as possible. We've done it all magical.

There's a snippet of code. It's basically two lines of code. You put just once in your code and that's it. You're done.

Everything is in track for you. And you know that, and we kind of have regular calls to just talk about AI things because, you know, we both work in companies where there's not that many AI type people. And so we like to share things that we're learning and all that stuff. His name's Will.

Shout out to Will out there if you're listening. But I asked him one of the first kind of times we were talking about his workflow and all those things. We got into this topic of MLOps and all that. He's like, oh, I use this AllegroTrain thing.

It's amazing. And I was just talking to him earlier today, actually. And I was like, hey, I'm going to talk to the AllegroTrain, the AllegroAI people later today. What do you want me to say?

And one of the things he said was it's just for him, it's super easy. Like you were saying, pretty low barrier to, you know, add the snippet to your code and kind of things happen automatically like you were talking about. And the other thing he definitely wanted to mention was that the team is super responsive and he mentioned raising various things on GitHub and all that. And the team is very responsive.

So great job. You've got a very happy user in Will here in Indiana. Well, thank you, Will. Awesome.

But yeah, he was kind of telling me about how some of that works. And then there's also, you mentioned the agents. The agents, those have to do with kind of the more automated runs that happen across a set of shared resources or where does that fit in? So the agents are, if you basically want to run your code on a machine, you basically set up an agent on that machine, whether it's a DGX or a GP or whatever you have it.

And that agent is tasked with basically, that agent is then associated with some queues that you create. It could be associated with one or more queues. So it's a little piece of code that sits on any machine that is potentially a target for running your experiments on. So one of the things I'm curious about, and I meant to ask you this a little bit a while ago when you were touching on it, was some of the motivation that you had for going with an open source business model that builds an enterprise business on top of that.

And did you always know that that was going to be the approach you guys were going to take or did you consider any others? And how has that model worked out for you? That's a very revealing question for us. When we started out, we probably earned on where the market, you know, so I guess one of the things that you do in a startup is when you're trying to time the market, which I think I saw, you know, several articles talking about, you know, timing being the number one critical aspect of a, you know, sort of success and actually one of the hardest to hit, right?

And sometimes even VCs call it luck. But we were trying to time the market because what we had built initially was around the holy grail about the data. And we basically built a system with the thought in mind of, well, you know, companies are now doing development, but they're going to get to scale and they're going to be able to manage huge data sets that constantly change after version that you have lots of experiments about these things running on multiple clusters. How do you handle all of that?

And so we actually set out to build this really big robot system. And then we found out that very few companies were at the stage where they needed this or realized its value. And so we got back and started thinking, you know, where is the industry now and how can we help the industry progress? And we figured that the right thing to do is to meet the industry where it is, which was, you know, before that scale and come up and say, you know, all right, so what are the low hanging fruit of things that can bring immediate value to data scientists out there?

And that was first thing was the experiment management and immediately after that, the NL ops or at least the NL ops in its lighter form, right? Don't think of a huge conglomerate running, you know, hundreds and thousands of experiments, but, you know, even small teams. And we thought that the best way to do that to really contribute to community, help spur that along, make that, you know, something that a lot of people can do stuff better in the way we think it would be better and helpful. And ultimately, obviously, you know, we're a company, we're about making money, I think users and also attention and kind of joining in with the PyTorch ecosystem like in the blog posts and other things, I think that that really allows people to, you know, solve a pain point that they really have really, really quickly and hopefully it does eventually spur them on to, especially if they're part of larger companies or teams, you know, integrate more with your enterprise systems.

But it's been amazing to talk today. The topic's very close to what I'm super passionate about, I think, Chris, as well. And part of the reason why we do this podcast is to talk about those practicalities of how people do their AI development. So really appreciate you joining.

We'll link the demo server and the links to Allegro Trains on GitHub and also your main website which talks about all of your offerings. We'll put that in the show notes for sure and I encourage people to go there and check out those things and let us know in Slack or on LinkedIn or other places what you think and how you like what they're doing. But really appreciate you joining here. It's been a great conversation.

Thank you so much. It was a pleasure. It was fun. And really, thank you so much for having me.

Have you joined the free ChangeLog community yet? I'm not sure what you're waiting for. You get ChangeLog news, email notifications of new podcast episodes, access to our Community Slack and Practical AI channel where fun and interesting AI discussions take place all the time, all for the price of a free hot dog. Check us out at ChangeLog.com slash Community.

We'd love to have you. Practical AI is hosted by Daniel Whitenack and Chris Benson. It's produced by me, Jared Santo and our music is provided by the mysterious Breakmaster Cylinder. We're brought to you by some amazing companies who get it thanks to Fastly, Linode and Rollbar.

That's all for now. We'll talk to you again next week.

Share this episode

Similar Episodes

Milk Proteins without the Dairy - Adam Tarshis and Dr. Cory Tobin

Jun 9, 2026 ·50m

New Technology in Severe Burn Care - Dr. Katie Bush

Jun 1, 2026 ·31m

New Methods in Early Cancer Detection - Dr. Nate Montgomery

May 25, 2026 ·39m

Strategies in Combating Chronic Kidney Disease - Dr. Salvadore Viscomi

May 17, 2026 ·37m

AI and the Future of Healthcare -- Dr. Emilia Javorsky

May 8, 2026 ·39m

The First Environmental GE Organism Release - almost! Dr. Steven Lindow

Apr 28, 2026 ·25m

Similar Podcasts

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 51 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on July 20, 2020.

What is this episode about?

DevOps for deep learning is well… different. You need to track both data and code, and you need to run multiple different versions of your code for long periods of time on accelerated hardware. Allegro AI is helping data scientists manage these...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.