Pachyderm's Kubernetes-based infrastructure for AI

What this episode covers

Joe Doliner (JD) joined the show to talk about productionizing ML/AI with Pachyderm, an open source data science platform built on Kubernetes (k8s). We talked through the origins of Pachyderm, challenges associated with creating infrastructure for machine learning, and data and model versioning/provenance. He also walked us through a process for going from a Jupyter notebook to a production data pipeline.Sponsors:DigitalOcean – DigitalOcean is simplicity at scale. Whether your business is running one virtual machine or ten thousand, DigitalOcean gets out of your way so your team can build, deploy, and scale faster and more efficiently. New accounts get $100 in credit to use in your first 60 days. Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com. Rollbar – We catch our errors before our users do because of Rollbar. Resolve errors in minutes, and deploy your code with confidence. Learn more at rollbar.com/changelog. Linode – Our cloud server of choice. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code changelog2018. Start your server - head to linode.com/changelogFeaturing:Joe Doliner – Website, GitHub, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:PachydermPachyderm on GitHubPachyderm tutorialsDoD challenge built using PachydermUpcoming Events: Register for upcoming webinars here!

of MATCHES

TRANSCRIPT · AUTO-GENERATED

Bandwidth for Changelog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at Rollbar.com.

And we're hosted on Lino's servers. Head to lino.com slash Changelog. This episode is brought to you by DigitalOcean. They now have CPU-optimized droplets with dedicated hyperthreads from best in-class Intel CPUs for all your machine learning and batch processing needs.

You can easily spin up their one-click machine learning and AI application image. This gives you immediate access to Python 3, R, Jupyter Notebook, TensorFlow, Scikit, and PyTorch. Use our special link to get $100 credit for DigitalOcean and try it today for free. Head to do.co slash Changelog.

Once again, do.co slash Changelog. Welcome to Practical AI, a weekly podcast about making artificial intelligence practical, productive, and accessible to everyone. This is where conversations around AI, machine learning, and data science happen. Join the community to cycle this around various topics of the show at Changelog.com slash community.

Follow us on Twitter, we're at PracticalAI.fm. And now onto the show. Welcome to Practical AI. Hey, Chris.

How's it going, man? Pretty good. How you doing, Daniel? Doing really good.

I'm really happy today with the conversation that we're going to have because we're going to be talking to my old colleague and still great friend, Joe Dolaner, or as I call him, JD. Welcome, Joe. Hey, Dan. It's great to be here.

Hey, Chris. It's great to meet you on your show. Great to meet you, too. Yeah.

Thank you so much for joining us. Thank you for having me. Yeah. Why don't you give us a little bit of background about what you're currently involved with and how you got there?

Yeah, absolutely. So as you said, I'm Joe Dolaner. Everyone calls me JD. I am the CEO and founder of Packeter, which is a company that builds data science tools that we'll be talking about today.

Before that, I've worked at a number of startups. Probably the most relevant one to this conversation is that I also worked at Airbnb as a data infrastructure engineer, basically just managing their AI and data infrastructure for the company. And so I have a lot of experience on the infrastructure side of data science, less so as an actual practitioner. And so that's most of what we're going to be talking about today.

Awesome. Yeah, that's a perfect setup. I think that we've done a lot of talking about AI, but we really haven't got into a ton of infrastructure stuff yet, I don't think. Have we, Chris?

Not really. And I think this is an episode long overdue. And just to note to the listeners, I know you had said that you had previously worked with JD at Packet Derm. I have not.

I'm familiar with Packet Derm as a newbie, so it'll be an interesting conversation for me having a couple of experts on here. And I'm going to ask all the stupid questions, OK? Well, you know he's not my inside man. Dan might be, but Chris definitely isn't.

Yeah. Yeah. Full disclosure, I might be a little bit biased, but only I don't officially work for Packet Derm anymore, although I'm a huge fan. I'm actually using Packet Derm on my current project, so I'm a huge fan and have that bias, but I'm excited to dive into the details and have you learn a little bit more, too, Chris.

Yeah, absolutely. Over the time that we've known each other, since we first met and you've been talking about it, I've adopted it. I have a long way to go to catch up to where you guys are in terms of using it as well, but as a beginner, it's definitely something I'm interested in, so I can't wait to hear more from JD. Definitely, yeah.

So with that, JD, why don't you give us just kind of a high-level overview of what Packet Derm is and kind of the needs that it's fulfilling or what it's trying to do for data scientists and people working in machine learning and AI? Yeah, absolutely. So Packet Derm is basically designed to be everything that you need to do high-level production data infrastructure in a box. And so what that means, if you're used to doing AI workloads in Jupyter notebooks on your laptop or maybe just in Python directly, using something like TensorFlow, something like that, Packet Derm is not in any way saying that you should stop doing that.

Packet Derm is just giving you a way to take that code and deploy it on the cloud in a distributed fashion so that you know it's going to run every single night or hook it up with its processing steps so that you can have everything sort of going in a pipeline end-to-end. And this is what companies turn to when they sort of need to make that leap from a model that's on somebody's laptop to something that's like a core part of their business that's going to run every single night. This sort of all came out of my experiences at Airbnb, where I was basically trying to make a platform that did that for our data scientists. And while I was working there, I had a couple of sort of novel ideas for what I thought that the world of data infrastructure was missing and what I wanted to bring to it.

So the first really unique thing that we did with Packet Derm is, you know, we needed a way to store data. So we have a distributed file system. It's called the Packet Derm file system. If you're familiar with the Hadoop ecosystem, this is probably something pretty similar to HDFS or Tachyon or something like that.

What's different about our file system is that it's capable of version controlling large data sets in addition to storing them. And so you can have, you know, your training data set, it can be terabytes of data, and this data is constantly coming in from your users on a website, from satellite imagery or something like that. And the Packet Derm file system will actually give you discrete commits like in Git, where you can see, okay, this is what my training data set looked like a week ago. This is what it looked like a month ago and things like that.

And what's really important for AI, that is not only do we keep these different versions, but we actually link them to their outputs using a system that we call profit. And so at any time when you train a model in Packet Derm, you can ask the system, what is the profit for this model? And it'll trace you back to all the different pieces of training data that went into it and all the different pieces of code that went into training this model so that you can basically see where it came from and you can reproduce your results. Does that make sense to you guys?

It does. I'm going to dive in since I'm the newbie on this. And I'm asking this on behalf of the listeners and partly for myself. First of all, a quick question.

Is it a proprietary system? Is it open source? This is all open source. We do have an enterprise system that goes on top of it.

And I'll talk to you later about what features are limited to the enterprise system. But nothing that I've talked about up until this point is that this is all open source so you can download it yourself. Okay. And to kind of wrap our heads around it a little bit, you kind of mentioned file system and versioning and this sounds like a feature called Providence where you can go back and do that.

Could you kind of describe for someone who has never heard of Packet Derm what the feature set is and what kind of a typical use case might be so that in their own shop where they're doing data science, they can kind of figure out how it fits in with what they're already doing? Yeah, yeah, absolutely. So I think it's easiest to sort of focus in on a use case here. So one that I can talk about very publicly because it was a public competition was the Department of Defense was until recently running a competition where they were basically having people write image detection algorithms for satellite imagery that they had, right?

So they had a bunch of satellite images that they had taken and they wanted people to write models that would detect this is a hospital right here, this is a school, this is a bus, things like that. Interesting AI problem. Also an interesting architecture problem for that, right? Because they have people just basically throwing code at them through this web interface and they need to take that and run it through their pipeline and get results out the other end and give those users.

So the way that they set that up in Packet Derm is first they set up an instance of it and they deployed it on AWS, they used as a backing store, they used S3. So ultimately all of this was stored in object storage, which made it very, very easy for them to manage. And then they loaded all of the satellite images into the Packet Derm file system. And so that's, you know, you can get stuff in there in a number of ways.

You can get it in there directly from object storage, you can push it over HTTP. I'm not sure exactly which one they use. But from there, they now had a system where all the data was just sitting there in different versions. They could update it and have a new version.

And then anytime the user's code came in, they just deployed a new pipeline on Packet Derm. And that would then sort of all those images and process them in parallel. And out the other end after some processing would come just a score report that they could report back to the user. And that might include your code failed on these five images.

So you'll get a score or it might be your code succeeded on these five images. And here's how accurate you were. It would get a full report. So that's like, here's what you did well on, here's what you didn't do well on.

Does that answer your question? Or do you want to know more about sort of specific features within Packet Derm? No, that does help a little bit. I guess, I guess as a follow up, you talk about file system and stability versioning.

Are there any other kind of high level, like key things that you want to name that you really can't use Packet Derm without considering those features? So in terms of the file system, that really basically covers it. It does basically all the standard things that you expect from a distributed file system plus the versioning and provenance component. And that's really the only quirk to it.

Now, on the processing side, things also start to get interesting. And here is where we need to start introducing maybe a few jargony words that I will explain. So one of the sort of key things that we use in Packet Derm is containers. And I'm sure most listeners at this point have heard of the company Docker, which has been a very successful Silicon Valley company.

And they make this thing called a container, which is basically just a standard way to ship around code, right? Think of the problem that you've had where, you know, you write some script in Python that trains a model, then you send it over to your friend, and they've got the wrong version of Python, or they've got the wrong version of TensorFlow installed or something like that, and it's all incompatible. A Docker container is a way to ship code that's going to work anywhere, regardless of what the user has got installed on their machine or regardless of it in the cluster. Packet Derm's processing is all built on Docker containers.

And so what that means is that you as a data scientist, when you want to productionize your code and take it off of your laptop and into the cluster, then all you need to do is package it up into a Docker container, which means that there's a little bit of a learning curve there to understand the tooling of Docker. But once you've got that, you as a data scientist are now completely in control of the environment that your code runs in and all of the dependencies and everything like that. And so once people rock this, it's actually very, very liberating. And the reason that I wanted to build this on top of containers was because when I was at Airbnb, we would have these problems all the time where a data scientist would come to me and they'd written some new piece of processing that they wanted to be in the company's pipeline.

Could be a machine learning model or could just be something as simple as data cleaning or something like that. And they would send me the Python script, and then I would realize, oh, this isn't quite compatible with what I've got on the cluster. And we didn't have Docker containers there. We just had one big monolithic cluster.

And so if we didn't have the right versions of Python installed, I actually would have to either redeploy the entire cluster just to run that one user's code, which was very untenable, or I would have to have them change their code to use different versions, things like that. So it was this constant back and forth where the data scientists couldn't quite use the tools they wanted. Our infrastructure people couldn't quite maintain a cluster with a consistent set of tools. And so I had this aha moment where I realized if these guys could just use Docker containers, then this impedance mismatch would totally go away, and we could both do our jobs a lot more easily.

Does that make sense, Chris? I was just going to say, following up on that, it's kind of like whether you're using Python or R or Java or whatever the different tool you're using is or the language you're using, essentially these containers unify the way that you treat each processing step. Would that be an accurate way to say it? Absolutely, yeah.

So it allows us to basically handle the infrastructure the same way, no matter what code is written in. And so we have a lot of companies where one of the things that's really appealing about Packard is that all their data science is just no different languages. And they're looking for some sane way to have everybody writing code in their own language and pull it all together into a system that they can understand. And Packard allows them to do that.

Now, the key thing about this, of course, is that because we have the providence tracking, you can still see the fact that, oh, this data followed through all of these steps and came out the other end, even though one step was Python, one step was Ruby, one step was Java, one step was C++. And you didn't have to write any special tooling within those languages to track the data. Yeah, that's awesome. So I'm going to kind of pose a problem.

I want to see if you would kind of go about things the same way as I would, JD. So let's say that we have a, you know, we have a Jupyter notebook. And I like how you kind of brought that up before, because that's where a lot of data scientists kind of start out. So let's say that Chris and I have been working on this Jupyter notebook that has some preprocessing for images.

And then we train a particular model, let's say, in TensorFlow. And then we output results and then maybe do some post-processing. And to test it out, we just kind of downloaded like a sample data set of images locally. And then we've kind of proven that, yeah, this is like a good way that we think we should do this kind of in this Jupyter notebook.

So in order for us to kind of get that scenario off of our laptops and into Packeterm, what would be the things that we would need? What would be the stuff that we should do, both on the data and the processing side? That is a great question. And I think will be a really illustrative answer.

I'm going to sort of try to answer this with, rather than jumping straight to the like, so here's the end state of this, where I think it's like you're using all the Packeterm features. I'm sort of going to build it up piece by piece, which is how we recommend data scientists to do it. So the first kind of problem that you need to solve when you want to put a Jupyter notebook into Packeterm is the fact that Jupyter notebooks are meant to be interactive, right? They're meant to have a user like opening up the browser and actually clicking the run button and stuff like that.

And so the first thing that you can do is you can actually run sort of Jupyter inside of a Packeterm service and you can just run Jupyter notebooks all by themselves, but they can't just turn into a pipeline that runs without any human intervention, right? Because Jupyter is designed that way. Like an automated and triggered sort of way. Right, right.

So the first step to do is just to extract the code from Jupyter. I'm pretty sure Jupyter makes it very easy to export as a Python script at this point. And so you would do that and then you would put that in a Python container with whatever dependencies you need. And to start, I wouldn't even tease apart these different steps, the pre-processing, the model training, and the post-processing.

You could just do all of those in one container and you wouldn't even necessarily need to paralyze the data because if it was running on your laptop, it could probably run on a PC2 node as well. And so that process, I think, would take you, you know, if you had Packeterm set up to begin with, you could probably do that in 20 minutes. And then you would have gone from a system that you can run manually on your laptop and edit to a system that now runs every single time a new image comes into the repository or you change the code or something like that. And so, and also, of course, now it's deployed on the cloud, so you can easily throw GPU in there if you want.

You can easily throw more memory at it and stuff like that. And so, so now you have sort of the first step of a productionized pipeline. Now, the next step is figuring out which of these steps doesn't make sense to tease apart so that maybe their outputs can be used by other steps. You know, in the future, you might want to do the same pre-processing and then train multiple different models and then do the same post-processing on them or something like that.

And so I would separate out the pre-processing step, the training step, and the post-processing step into their own individual pipelines. And so now I've got a chain of like three steps and each of these is doing something different. And now I get the opportunity to sort of optimize each of these steps individually, right? So the pre-processing step, for the most part, the pre-processing steps that I've seen can be done completely in parallel, right?

You're doing things like cleaning up the images. You don't need to see all of the other images to clean up one image. Parallel as far as like in a sense of distributed processing, like processing things in isolation. Exactly, exactly.

So that's another of the like sort of important things that we get from a container is that it's very, very easy for us to scale that up, right? So you can say, I need to process all these images. Here's a container that does it, but don't just spin up one copy of this container. Give me a thousand.

And so you're now cranking through a thousand images at the same time rather than one. And so you'll get done much, much faster and you can handle much, much bigger loads. So I would do that with that step. The training step, making training happen in parallel is definitely a much more complicated question than making something like pre-processing happen in parallel.

So normally we would still keep that as a non-parallel thing because your code needs to see all the data to train on it. If that is not true, if you really want to start paralyzing that, that is when you want to start looking at things like Qpflow, which we integrate with, as you know, Dan. Although we're still working on making that integration better. And then the last step, the post-processing step, that one could sort of stay as is unless you were anticipating having a lot of things that you wanted to post-process in parallel.

So for example, when the DOD did their pipelines, they're just all designed around the fact that we have one data set, but we have, you know, thousands of different people submitting models that they want to get tested. And so actually the post-processing step could be pretty expensive because they were just doing it for so many different entries. And so that was happening in parallel as well. From an infrastructure perspective, that's basically the idea of these pipelines is that when you segment these steps off into little pipelines, you then get complete control over the infrastructure on a pipeline-by-pipeline basis.

So you get the ability to say, like, this one needs to run in parallel with a thousand copies of the container up, and each of those containers needs to have a GPU accessible to it, and this much memory and stuff like that. And this one over here is not doing it really much at all, so, like, it just needs one container and we'll fit that in somewhere. And the system sort of automatically figures out how to make all of this work with the resources that it has. Okay, hey, JD, that was a great explanation.

As a beginner, I have a few questions I'd like to follow up with. First of all, you mentioned Qpflow, so I take it that Kubernetes is part of the architecture that you're deploying onto? Yes, I guess I jumped the gun on that one mentioning Qpflow before Kubernetes, but yes, this is, I think, now when we need to bring in one more jargoning word, and this will probably be our last infrastructure jargoning word, which is Kubernetes. If you've heard of Docker, you've probably heard of Kubernetes as well.

Actually, at this point, I think if you install Docker, it just has Kubernetes built into it. You should think of Kubernetes as kind of the puppet master for your containers, right? So a container is a really, really good way to deploy a single piece of code, like a program. It's literally just a process inside the box.

complicated distributed applications you need to deploy a bunch of programs on different machines and make sure that they can all talk to each other and they have the right resources and everything like that and that's the piece that kubernetes handles so kubernetes allows you to speak in very high level terms there were a lot of the terms i was talking about pattern speaking of basically being able to say i want you to make sure that there is a copy of this container running somewhere you have a thousand machines you have the code to run just make sure that this is always up somewhere and i can talk to it consistently when i hit like this ip address or something like that and kubernetes will figure all that out in the background for you and you know it can be instead of one copy it can be a thousand copies and they can have specific infrastructure requirements like gpus and stuff like that and kubernetes just solves all of that and deploys all these containers and so that's how we accomplish that with pattern is we basically just take these kubernetes semantics and then augment them with knowledge of the data that needs to be processed and capture how that data gets processed and where it goes gotcha so just to kind of catch up a little bit make sure i'm on the right track you have kubernetes deployed for infrastructure and you're deploying pachyderm on top of that uh and you have the file system that brings the versioning and your capability for profit tracking and you talk about the pipelines and stuff i take just to ensure that i'm on the right track i assume that the data is in the containers that you're deploying specifically yeah so that's that's where it starts to get interesting um the data is in the containers but it's kind of ephemerally in the containers because containers themselves are kind of ephemeral part of the point of a system like kubernetes and the reason that you give it you know a thousand nodes to operate on is that any of those nodes could die at any time right and this is the sort of thing where like this is technically always true you know even when you're just running your code on your laptop your laptop can die at any time it's a physical machine but this isn't such a concern when you have one computer but when you're running on a thousand it's almost guaranteed to happen once a day just because you've got so many machines there and so we put the data into your container for you to process and then when you finish processing it we write it back out to object storage and that's where once it's in object storage that's when it's actually persisted within our architecture because nothing that's stored on a disk in a container any of that stuff could disappear at any moment is basically how we operate this is also a great opportunity for me to talk to you about what the actual interface that your code gets to the packer data is we really really wanted to build a system that was going to be language agnostic one of the things that really bugged me about the hadoop ecosystem was that you sort of had to write in java to really get the most comfortable semantics like you could kind of use python but it was always a little bit kludgy and so when your code that you put in a container boots up and because packer wants it to process some data you will just find your data sitting on the local file system under a directory called pss and these are just totally normal files you can open them with you know the system call open and you can read from them and write to them and stuff like that and so this we thought was just the most natural interface that your code could possibly have and users often have the experience when they've just written you know a jupyter notebook to process some stuff on their laptop normally they're just getting that data from local disk too and so they have the experience when they're getting out of the packer like okay like i'm going to need to learn the packer maybe i'm going to need to like import packer into my python code or something like that like no you can just you know just use your normal os system calls to open data and write data out and that's that's the entire system that's all you need to do yeah so i have a fault there and um maybe there have been some some updates that i'm not aware of but i think one of the common kind of maybe struggles that i've seen people ask about is you know this is definitely fundamentally different than something like hadoop or spark where you have like some concept of data locality here you're kind of like putting data into the container and then taking it out but it actually lives somewhere else are there concerns with that are there like like trade-offs what is what are the sort of trade-offs that you're playing with there especially as you get into kind of larger data sets and that sort of thing yeah so there's absolutely trade-offs right because each time that means that the data needs to be downloaded from s3 written to local disk which is normally faster than s3 so that doesn't really incur a penalty and then it needs to be pushed back into s3 and so basically what you're trading off here is that this system could be more performance if performant if it was entirely using hard drives but it would be basically harder to for admins to maintain right because the thing that people like about object storage is that it's just really dumb and simple you just got a bucket sitting there with all the data in it there's no like which hard drive is this on like we have all the hard drives are they linked up to the right things and stuff like that the reason that we chose this architecture as sort of our initial architecture is that this was a lot of the direction that we saw we saw people basically making the same trade-off in hadoop even though they didn't have to so by far the most common hadoop cluster that we see today and this applies to spark as well is basically everything stored in object storage almost always s3 and then map reduce on top of that and a lot of people are just bypassing actual hdfs at this point we have been making um over the last release and we're going to do a lot more of this in the upcoming one time release a lot of progress toward using hard drives to cache stuff and so we're sort of going the other way they were first a hard drive only solution and then they started having like s3 as a way to like checkpoint stuff out to long-term storage and then eventually that started becoming the only way that people ran stuff we're always going to have object storage is like the long-term place that we checkpoint stuff out to and then we're going to use hard drives on top as like a cache and that will also allow us to use boatloads of memory as a cache too similar to tachyonic people want like really really low latency stuff cool yeah the times that i've interacted with with spark i kind of like i always defaulted to that s3 option anyway because it was hard for me to figure out other things i don't know if that's just my own you know my own ignorance or whatever it is but i definitely i definitely hear you on that on that front but yeah it's kind of like uh there's always trade-offs right you don't get anything for free but it's really kind of what you want what you want to optimize for yep it's always true and actually one of the things that we we do a lot of is trying to counsel people to not worry as much about like performance on the margins in the early days because we've seen a lot of like infrastructure deployments and like data science projects that just get really bogged down and think like well there's gonna be this extra cost of data getting copied from s3 and getting back and stuff like that and we always try to tell people like worry about these things if it's truly going to make it impossible for you to accomplish your goals like this absolutely needs to be on the latency system because you're doing like algorithmic training or something like that but in a lot of cases we feel like people get better results by just focusing on getting something that works um and that's you know i think exactly the trade-off that you were making when you were setting up spark is that like yeah if you really bang your head against the wall like you can figure out how to set up s3 on like solid state drives on aws and it's gonna be faster than what you're doing what you're doing with s3 but if you consider the amount of time that you spent setting that up as like performance time until you actually get your results you might actually get them much much slower so there's a huge amount of value to have an infrastructure that you understand top to bottom and that is simple so wanted to ask about that we kind of talked about a lot of different technologies you know in these in these potential use cases and i know that kind of getting back to teams and individual skills blood teams where the skills were you know varied fairly widely some people like myself came from software engineering into the ai world the machine learning world and others uh came straight out of school and you know data science degrees and had not had not done some of those do you ever find that there is uh any challenge or intimidation where people come out and they may know their data science but you know they may not have even heard of kubernetes or not be familiar with containerization um i kind of want to call that out because like you know me and you and daniel are all incredibly familiar with containerization kubernetes and such but not everybody is kind of how do you speak to that do you recommend a data engineer or infrastructure engineer get involved or what have you run into in real life yeah so that's definitely a challenge for us and we really see the full gamut and it's just very very interesting you see some people who like build themselves as like look i'm a data science person like i've never really done a serious software engineering like i don't really keep up on this stuff and then you sort of just sit them down and say like all right well here's what docker is like here's how you install and stuff like oh this basically seems to make sense like i can get by here and then there's some people for whom we do like education sessions um and basically just try to teach people the basics of containers so that they can work with it i would say that actually when we really have challenges it has it's less about software engineering expertise and probably more about devops expertise be honest like a lot of the the types of issues that we hit are just like the permissioning on the kubernetes cluster is wrong and so when you go to deploy like your code everything works until it starts trying to like talk to s3 and then like the network just doesn't work or something because like the bucket is rejecting it or something like that and like there's just a lot of devops complication in there and so you know we always sort of try to like keep our feet on the ground a little bit on this stuff because our whole goal with packer was when i was at airbnb i was like well this data infrastructure is really hard and my team is 25 people just keeping this darn thing running and so what are all the companies that don't have a team of 25 people to keep their data infrastructure running doing and so we wanted to make something where you didn't need that team like a data scientist could just do it by themselves and i think we're closer but you know then when we go into companies and talk to them like well we've got like one person working on this whole time and you know they're feeling like they have to do a lot of devops to keep the packer cluster up and running i sort of realized like okay you know we haven't we've made an improvement here we haven't just magically eliminated this you know we haven't gone from you need 25 devops people to keep big infrastructure running to you need zero devops people to do it and so we're trying to make that better at every release we're trying to make that as easy as possible and one of the big steps for that will be having our own hosted solution so people don't have to deploy everything on their own cloud just to try it out short answer is that's definitely a challenge is that there's a bit of an infrastructure leap that needs to be made which can be uncomfortable for a lot of people that could ultimately benefit from the future set of packer it's just they can't get the activation energy so i was wondering is there anything else you know another question you commonly find is people have existing infrastructure in place they might be a hadoop shop a spark shop or one of several other technologies you know they might have the databases like a sandra what are you trying to replace and how are you trying to fit in i know we talked about the data locality issue but are there any other big considerations that you would say is is you know why you should go pack it versus what they already have in house yeah i mean i would say the things we're trying to replace are sort of hdfs and then the computation layers on top of that so like map reduce is a common one but like hive and spark and stuff like that we're also trying to speak to those are the main things that we're trying to replace we constantly have the challenge of with people who have existing data infrastructure and want us to sort of fit into that well and that's always a bit of a back and forth because some things can work really well in packard because you can just you have the flexibility of a container and so you can put whatever you want in there so you know people will have containers that include code so they can go and talk to hbase somewhere else in the cluster right and so then you have sort of a natural like shim to put between your existing infrastructure and packard which is the container code which is totally flexible it doesn't work beautifully for everything right like what you wind up doing with like spark or something is you wind up having like here's your data it's stored in packard now you have a job and you want to talk to spark so now i need to push all this data into spark or somewhere where can access it or something like that so we're sort of constantly trying to figure out how to make these integrations better but the users that always excite us the most are the people who basically come in and say like we don't want to go down the hadoop route like we know that there is a lot of just pain required to get a working hadoop cluster and to get stuff functional on it and so we want to try something different and just build from packard from scratch and so long term for our company we're focused on how can we make things really good for people who just see the packard vision and commit to it from scratch because those are you know if we're successful in 10 years then those are going to be the people that have really made the company successful and the sort of the integrations will help us along the way to onboard more people but it's really going to depend on that core use case yeah so the team that i'm working on now the organization is pretty big but it's kind of on this project that i'm working on it's like myself who has some type of data science background and then another guy who is somewhat technical but he's a linguist and so our ability to spin up a like a working hadoop infrastructure is probably like less than zero percent probability and so i mean even just like if there's one thing i could say to listeners like even if you just get to like where you can use containers themselves it's like a huge benefit also to like reproducibility in the in the space of machine learning and ai which is is awesome so i kind of wanted to follow up you've already mentioned jd that packard at least what we've talked about up to this point is is free but i also know like you're a company right um and i should give you some congratulations because you just kind of hit a big accomplishment isn't that right yeah and thank you for the congratulations uh we just raised a series a which means that we have a ton more funding to basically pursue our vision for data science infrastructure it also means that you can commit to packard infrastructure with a lot more peace of mind now because you know the company is going to be around for quite a ways to come that also sort of leads as you said we are we are a company which means that um we need a way to make money and that for that we have an enterprise product so let me just sort of tell you what's in that that you won't find in the enterprise um we try to really make it so that our open source product contains everything that's going to be really useful to sort of individuals um and people who you know just want to get some data science done but they're not running within a gigantic organization where they have all those concerns so the types of things that go into that enterprise product are the permissioning system and so that's you know the ability to say like this data right here is owned by dan this data right here is owned by jd this data right here is owned by steve things like that and make sure that nobody is getting data that they don't have access to and what's cool and what we think is a very crucial feature for these types of systems is that it's informed by our provenance model right this is a big problem that you'll run into in big data organizations is that it's very easy to have some data that nobody's allowed to see that then gets turned into a model or some sort of aggregation or something like that that everyone's allowed to see that is accidentally leaking the data that went into it and so we have our provenance tracking system that formed the permission system so if you don't have access to the provenance of data then by default you don't have access to the data itself because it might contain that information that you're not allowed to see other things that go into the enterprise product are like a sort of wizard ui builder for building new pipelines and things like that and visualizing how they're working and the ability to sort of track and really optimize your pipelines see like where they're spending all their time and like squeeze every last little bit of performance out of your hardware the other main thing that we sell is basically just support and our time and you know the ability to like talk to us and have us prioritize features and stuff like that which is you know every open source project does that yeah it's really interesting it's uh i always love to hear like different people's perspectives on their open source models as well i was just talking to someone the other day a friend who's starting a new business and like considering how they should approach open source but yet also be a company and like and survive so i think there's definitely people out there that are interested in that question so i appreciate you sharing that yeah absolutely it's tricky and it's very imperfect because i really think that this is a system that that really should exist there's a lot of need for a system like this it basically has to be open source for it to actually fill that need in my mind i just wouldn't see a proprietary system becoming like the standard infrastructure layer but it's very very hard to get the funding to work when you're open source you know it's this huge asset because people can so easily try your product and you get so much adoption and stuff like that but it really anchors people with just like an unwillingness to pay for for software when it's open source and so you always sort of need to cross that threshold and one of the things that we're looking to do in the future now that we've raised more money is basically build the hosted version of our software because that just sort of totally it totally changes the value proposition but it also i think has some sort of psychological effects on people wherein like nobody would ever pay for git but the idea that you're going to pay seven bucks a month to have like private repos on github or something like that is just totally palatable to people i think that's a fantastic idea i love the hosted idea i know that when daniel first introduced me to pack you during a while back and i was kind of initially learning the fact that coming from the software engineering world that it was built on containerization and kubernetes was a huge plus for me if i recall correctly a lot of it's in go which i thought was pretty pretty amazing as is donker and kubernetes i guess if you're just hearing about it and you kind of come away from this episode today and you want to learn more about it and maybe want to dive in get your hands dirty and figure out if it's right for your organization how do people get started with that yeah so we've got a bunch of tutorials and like quick start guides online and so you know if you want to sit down with a guide and start hacking away then that's the way to do it we also have a very active user slack channel where all of our engineers and everyone on the team is just always hanging out and ready to ask questions and you know those questions range from like i hit this arrow what do i do and you know we just we just give you a simple response if it's simple hopefully it's simple and to people also asking us you know i'm looking at packard for a new project talk to me about the feature set you know talk to me about how you think this could be helpful here and just like talking to us and so i think that's that's really the best way if you want someone to talk about to talk to about stuff is just stop by the slack channel awesome well thank you so much for taking time to talk with us jd of course we'll put the links to like the tutorials and the docs and the slack channel and all that in our show notes so go check those out but um it's been awesome to hear from you and uh really excited to hear about the progress with packard and all the good things you're doing yeah thanks so much for having me man i love hearing that podcast all right well look forward to seeing great things from packard thanks again thanks for coming on the show all right thank you for tuning into this episode of practically high if you do the show do us a favor go on itunes give us rating go on your podcast app and favorite it If you are on Twitter or social network, share a link with your friends. Whatever you got to do, share the show with friends if you enjoyed it. And bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com.

And we catch our errors before users do here at ChangeLog because of Rollbar. Check them out at Rollbar.com slash ChangeLog. And we're hosted on Leno Cloud servers. Head to Leno.com slash ChangeLog.

Check them out. Support the show. This episode is hosted by Daniel Whitnack and Chris Benson. Everything is done by Tim Smith.

The music is by Breakmaster Cylinder. And you can find more shows just like this at ChangeLog.com. When you go there, pop in your email address. Get our weekly email keeping you up to date with the news and podcasts for developers in your inbox every single week.

A few of us get together and chat about JavaScript, Node, and topics ranging from practical accessibility to weird web APIs. You can just eval the text that you're given and that's basically what it's doing. What could go wrong? Yeah, exactly.

This is not legal advice to eval. Text that comes in. Join us live on Thursdays at noon central. Listen and slack with us in real time.

Or wait for the recording to hit. New episodes come out each Friday. Find the show at ChangeLog.com slash JSParty or wherever you listen to podcasts.

Share this episode

Similar Episodes

Milk Proteins without the Dairy - Adam Tarshis and Dr. Cory Tobin

Jun 9, 2026 ·50m

New Technology in Severe Burn Care - Dr. Katie Bush

Jun 1, 2026 ·31m

New Methods in Early Cancer Detection - Dr. Nate Montgomery

May 25, 2026 ·39m

Strategies in Combating Chronic Kidney Disease - Dr. Salvadore Viscomi

May 17, 2026 ·37m

AI and the Future of Healthcare -- Dr. Emilia Javorsky

May 8, 2026 ·39m

The First Environmental GE Organism Release - almost! Dr. Steven Lindow

Apr 28, 2026 ·25m

Similar Podcasts

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 41 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on December 3, 2018.

What is this episode about?

Joe Doliner (JD) joined the show to talk about productionizing ML/AI with Pachyderm, an open source data science platform built on Kubernetes (k8s). We talked through the origins of Pachyderm, challenges associated with creating infrastructure for...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.