You know, if I could go back, that is the one thing I would change about. If I don't go off into a Docker container, for pure ease of learning and ease of training, nothing beats CoLab, in my view. Yeah. They have the best simplified interface that has everything that you need there.
So, yes, given that option, I will often use CoLab to go do that. Dan with your ChangeLog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at ChangeLog because of Rollbar.
Check them out at Rollbar.com. And we're hosted on Leno Cloud Servers at leno.com slash ChangeLog. Leno is our cloud server's choice. Grab the Nano plan for just $5 a month, just $5.
That gets you a gig of RAM, a blazing fast 25 gig SSD, and one terabyte of transfer. Let's be honest, you can go a long way on that $5. When you do need to scale up, their prices are predictable, so you can put your calculator down. You won't need it.
We've been running ChangeLog.com on Leno for years, and we've always impressed by their award-winning support team. Check them out at Leno.com slash ChangeLog. Once again, that's Leno.com slash ChangeLog. Welcome to Practical AI, a weekly podcast that makes artificial intelligence practical, productive, and accessible to everyone.
This is where conversations around AI, machine learning, and data science happen. Join the community and Slack with us around various topics of the show at ChangeLog.com slash community, and follow us on Twitter here at PracticalAI FM. Welcome to another fully connected episode of Practical AI, where Daniel and I keep you fully connected with everything that's happening in the AI community. We're going to take some time to discuss the latest AI news and dig into some learning resources to help you level up on your machine learning game.
My name is Chris Benson. I'm a principal AI strategist at Lockheed Martin, and with me as always is Daniel Whitenack, who is a data scientist at SIL International. Hey, how's it going today, Daniel? It's going very good.
It's hot here, but I guess I'm in the Midwest of the United States, and you are in the South. I'm sure what I'm experiencing is nothing compared to Georgia heat. I think that the thing that really gets you in Georgia is the humidity. Yeah.
So it tends to be very humid here, and that's what usually gets people. It's not terribly hot today. Yeah. I think we're around 80 degrees Fahrenheit, so it's not too bad.
It's quite humid outside, though. Oh, wow. I think it's warmer here, then. It might be.
I think we're pushing 90, I think, Fahrenheit. I think that's where we're going to be tomorrow. I think we're popping back up with you tomorrow, but we have some weather that's come through and cool things out. But that's the way it is.
It's summertime, man. Yeah. I've also got a few boxes sitting right next to me. I finally decided to build a computer.
I know we talked about this a couple times. I have fond memories of building computers earlier on in my life when I was in grad school and then before that in college and other times, but ever since being a data scientist, I've always just had a laptop, you know? Right. I don't know.
It's definitely not necessary in any sort of way to have your own personal, you know, AI machine. But since I talk about this stuff so much, it's almost like a rite of passage, I guess. I feel like I should have that experience. So I don't know.
We'll see. I've got the boxes. I've got case, RAM, hard drive, and GPU. There you go.
But the rest is on its way and currently not functional. So for listeners, though, I know you can't see this, but Daniel and I are talking over Zoom, and he has the data center background up behind him. The Zoom virtual background. There you go.
You're building out the DGX rack there behind you. I'm not quite in the server room yet, although I do expect the room I'm in will warm up quite a bit when I turn my computer on whenever it's going. But yeah, I'll have to make sure and have the fan going, and maybe, you know, eventually it will be cold here in Indiana, so maybe that will be a benefit. I don't know.
I'm just glad to know you're not living in the server room there. You know, a shelter in place in the data center. I'm bringing the server to me. That's right.
I'm just sitting on my dining room table. Not even in my office, because we have a family member staying in my office as a bedroom right now during the quarantine. Gotcha. Yeah, it should be fun.
So one more person in our house and one more computer portable heater coming soon. Gotcha. Yeah, you know, actually you and I are doing something a little bit similar there in that today I'm going to buy three Amazon instances to build a Kubernetes cluster for my animal protection charity that I work on. So I've got to get the new Kubernetes cluster I've been running.
Yeah, that's exciting. It will always be an interesting experience. Are you going to sort of manage your own deployments sort of thing, or are you doing the EKS managed by Amazon type thing? So it's not my budget.
It's a charity's budget, and it's a very small charity with a very tight budget. So I'm simply buying the instances at rock bottom reserve prices and then doing everything myself to try to keep that as low as possible. Look into the tool COPS. That's exactly what I'm using.
Yeah, yeah. I've used that a good bit in the past, and it saves a lot of trouble. That's what I'm going with. Exactly.
Yeah, and COPS itself is an open source tool that's out there, and since Chris is working with a charity, and I also work with a non-profit, so open source things are a lot of times very nice to use. They are. And I thought today maybe we could discuss that a bit, but more in the context of AI, I guess. So open source and AI contributing to AI open source, open data.
There's a lot of different related things there, generally under like open source AI or open AI things, not to be confused with open AI, the company. There you go. Yep. Yeah, I don't know.
Does that sound interesting? No, that sounds really good. Ironically, our talk of building things and Kubernetes clusters leads kind of right into it, because with modern AI tooling, it's largely built on Docker and Kubernetes these days and such, so that's a perfect timing on that. Yeah, so maybe we'll just start by, I guess, talking about that.
On the show, we mention certain things very often, and I could think of many of those off the top of my head that are all open source, because I think that's probably the more standard case. So like TensorFlow, PyTorch, Docker, Spacey, Kubernetes itself, what else? A lot of these things are all open source, right? They are.
I mean, the software of artificial intelligence is largely built on open source, and people end up paying for hardware or services for hardware. Yeah. You know, that's kind of how the divvy is, you know, you budget for the hardware or the services to gain access to compute. Yeah, and maybe we should clarify as well what we mean by open source.
Maybe people are more familiar with that term or not familiar, and there is some confusing things around it. Actually, maybe one of the confusing things is open source doesn't necessarily mean free either. So I guess my background isn't in computer science, software engineering, so I'll probably have some like computer science people get mad at me. There's like a proper definition.
But I mean, open source, I think mainly its etymology derives from the fact that like you can see all the code that is there. The code is available for you to obtain and or modify you. So like with TensorFlow, for example, you can go to github.com slash TensorFlow slash TensorFlow. I think that's still the link unless, you know, yeah, I think it's really defined by the fact that when you distribute the code, if it's open source software and you're distributing that code around, you have to distribute the code of the programs or programs themselves with your distribution.
So you don't just get the executable you're running, right? Just a binary. You don't get just a binary. You get the source code along with that.
And typically, and I'm saying this strictly from personal experience, the vast majority of open source software, I would argue, is freely available for people to use. And then the way it ends up is a lot of times that licensing allows companies to integrate open source software into their own proprietary packages. And they do have to distribute the source code for that part of it. But they may also have proprietary code depending on the licensing available as well or as a service.
Yeah, so it should be said, too, that if you go to TensorFlow slash TensorFlow on GitHub, you can see all the code that makes up TensorFlow, at least the core part of TensorFlow there. Then there is a enterprise version of TensorFlow that Google came out with recently. And some of those elements may not be open, right? Some of them might be open.
I'm not sure. But you'll see this pattern a lot, too, where you have, I think they call it open core, where you have a core part of a tool or software that is open and you can use. And then there might be a set of additional functionalities or maybe even like an upgraded version that you have to pay for that has some extra features or maybe more robustness or maybe it supports multiple users or maybe it supports specific access controls or other things that are more enterprise-y, I guess. So that's another pattern that you'll see.
But on the TensorFlow slash TensorFlow, you'll see that there's the code if you go to that link. And then there's about a little ways down in the code listing, there's a license. And you'll see that this license is an Apache 2.0 license, the very common open source license. It's a very permissive one.
Yeah, very permissive. So it allows you to do a lot of things with the code. But there's a bunch of other licenses as well out there. There's the MIT license.
Which is also permissive. Yeah, yeah. So actually, there's probably a guide out there that details all the different ones. And I think they're actually on GitHub when you're choosing a license for a project.
They have a way to compare them. But yeah, some of these are more permissive than others and allow you to do certain things with the code that you're not doing with other projects that have a different license or something like that. Right. So that's one thing to be aware of.
But I think a lot of where people might get hung up with that is if you suck that code into your own project and it's part of your suite of software and then you sell that or something, there might be certain implications to that. But in general, a lot of people might use TensorFlow, for example, to train a model. And then that model is what they ship with their product or something like that. Right.
And that model, which may not be proprietary in any way. I mean, it is proprietary to that company. Yeah, to that company. And also the code that is running or doing inference for that model, even actually the training code for that it is using TensorFlow, like you're not just copying TensorFlow itself and selling TensorFlow.
You are using TensorFlow to create custom code in the same way you'd import other libraries and that sort of thing. So this is a whole world of thought around open source software and what licenses are good and not good. And actually certain companies have restrictions around if you're using an open source project, they might allow you to use code that has a certain license versus a code that has a different license. That might be something you want to be aware of with your own company as well.
Absolutely. And depending on policy, they may orient on the license in terms of approvals or they may focus on specific software itself along with this license. But all this is really relevant now to AI. And I think a lot of people that have come into AI from routes other than software, particularly open source software, are having to learn this as they go along, which I thought was one of the great reasons that we should talk about this today when you suggested it, is that as we see the field of AI maturing and evolving very rapidly, it is becoming integrated into what is essentially a software stack that different organizations have and their workflows and it is how they productively enable some of their software.
And so it's really being wrapped into the software lifecycle itself. And so it's now affecting people as we talk about this. As software developers, we might be talking about how do we contribute to open source code and open source projects, but now it makes a lot of sense to talk about how do we contribute to open source artificial intelligence and open data. Yeah.
I mean, as opposed to sort of normal software engineering workflows, data really drives how code operates in the world of AI. So how you get the data and distribute data associated with your AI project is very relevant. Before we move on, I'll just mention too, there's an episode 322 from our friends at the ChangeLog podcast. They talked to Manish from Dgraph about licensing and relicensing and all those sorts of things they did with that.
I found that episode very enlightening on these topics. So if you're interested, they dive a lot deeper into that. I had no idea you're going to mention Dgraph, but that's my current hot topic for myself. So Dgraph's awesome.
I'm moving into Dgraph right now for what I'm doing. Side note, yeah, Dgraph is a graph database and it's really, really great. Actually, the queried language that you use on top of it is GraphQL, which makes it really nice in a lot of ways and it's very performant. And yeah, anyway, if you're interested in graph databases, check it out.
I'll give a shameless plug because I do like that project, which is another open source project. It is. It's open source and it's a project that I'm integrating into my AI workflow at this point for the charity that I spoke about. Oh, awesome.
Yeah, that's definitely so cool since you mentioned it. Definitely. I love plugging the projects that we both use here on the podcast, especially if they're open source and there's a community around it like Dgraph. But I guess that's one thing.
Dgraph is a database and so you're using it in your AI project, but it's not AI software necessarily, but it's the data store associated with the AI software. Right. And that's how I'm using it. These sort of auxiliary or I don't know what we call it, supplemental infrastructure things are often driven by open source as well, right?
Right. I mean, to describe if people are wondering how that fits in, and it's not really specific to what I'm doing, this could be for a whole lot of different possibilities that if you are operating an organization and you have operational data, things that you're doing with whatever your organization is and you need a data store to keep that, but you may also want to provide analytics on that. You may want to provide, you know, apply some AI modeling to some of that data. And so it really all comes down to the fact that you are integrating AI into your software workflow.
That's a good sign. That's a sign of maturity. We deserve a better internet and the Brave team has the recipe for bringing it to us. Start with Google Chrome.
Keep the extensions, the dev tools, and the rendering engine that make Chrome great. Rip out the Google bits. We don't need them. Mix in ad and tracker blocking by default.
Quick access to the core network for true private browsing and an opt-in reward system so you can get paid to view privacy-respecting ads. Then turn around and use those rewards to support your favorite web creators like us. Download Brave today using the link in the show notes and give it a try on change.com. So we were just getting into the topic, and you mentioned something that was really important about AI not just being about code, but being about data.
And I think along with that data, a certain piece of data, which is the model itself, which is really just another piece of data. So there's the code piece, but then there's the data piece. And oftentimes there's this weirdness because code is open-sourced on GitHub. But then to me, it seems like, oh, like there's this very structured sort of way to go about finding open-source code and things.
And then open data is just sort of like all over the place. It's like totally scattered and weird. And like, I don't know if you have a similar experience. No, I do.
I think that there's been a lot of great work trying to address that problem recently. And we'll talk about some of those, you know, as we go forward in terms of how to find data. But yeah, I think the fortunate side is a lot of people that are already working open-source software are recognizing that they need code and they have the same problem. And so it's getting tackled fairly quickly.
Yeah. Maybe before we talk about how to find data, I guess there's also could be licenses associated with data, right? Sure. You aren't able to repost it in another place.
You aren't able to use it for these purposes. You aren't able to do certain things with the data. Recently, I downloaded some audio data from Mozilla's Common Voice project. So their workflow, like you find the data set you want, you put in your email to download it.
And when you do that, you also have to agree. I think the agreement was that you wouldn't try to identify, like personally identify the people whose voices are represented in the data. Yeah. So like that stipulation is very specific to that data set.
But I guess it is kind of common in the sense that there's a lot of data sets that you could potentially try to identify people within data sets, which is an issue. It's an interesting juxtaposition of kind of licensing plus responsible AI, you know, ensuring that things, principles like protecting PII, personally identifiable information are all integrated in. So I find that interesting that they did it that way. Yeah.
Yeah. And I guess as well, you know, models being another piece of data. So just as a reminder for people, like when AI people refer to a model, they're basically just referring to a representation of a network architecture usually. So like this number gets fed into this operation and then gets fed into this, et cetera, along with the parameters associated with those operations, which are called weights and biases.
And all of that can be represented in data, especially if your model has, you know, 300 million parameters, you're going to put that into a data file and store it somewhere. But it's just that it is essentially a complex. complex data structure yeah it's a data set yeah that is the output of your modeling yeah and so data in software operates on it data out yep there's a lot of pre-trained models out there and so like if i'm in a github repo and it's like a repo for this project someone did a project to do like object recognition or something i don't know and you have a license in their repo that's like apache 2.0 or whatever i don't know but then in their readme they say you can download a pre-trained model from this link and then the link is just a link to like s3 bucket link to download the model i'm not actually sure like what is legally implied by if anything by like what you can do with that downloaded pre-trained model now there are certain sites where maybe that's more specific in terms of what you download but in that case which i think is actually a very common case the sort of here's my github repo and here's a link to my model i don't actually know if there are legal implications to what you can or can't do with that yeah so not being an attorney but playing one on a podcast yeah i would say that that data was still distributed and it was distributed under a legal condition probably represented by license and so even if that license is shortcut meaning it's not included in the link because they didn't download the whole repo or something then i would expect that that data would still fall under whatever license it was distributed under yeah it's a good question so if there's anyone that knows more about this out there i would be curious so hit us up on slack or linkedin or twitter and let us know maybe our friends at amuda who we had early on in the podcast that's true i don't know if you saw but they got a whole bunch of funding was it like 40 million or something it was substantial i don't remember what yeah that's why i think intel capital or something so amuda we had on early in the podcast so not only that they have a data product which is very interesting but they're also very much legal experts in these sorts of things so if you're listening out there you know let us know your thoughts maybe we should turn to how to find and you know search for open source tools and code and data and models what are your go-tos for that probably for me the same as most other people certainly that are in the software world obviously it's just googling for certain terms um you know googling for some particular function and saying open source along with that going to github going to blog entries that focus on open source uh ratings and distributions and such usually it's not hard especially in the software because that's been going for such a long time and it's you know we kind of have our inroads there so i can usually find something that is more or less what i want within just a moment or two of an initial search and then just diving into the tool for a while it was a lot harder on the data side to do that but the tools there are starting to come about as well yeah it seems like in my workflow a lot i'll almost start from known trusted sources and then kind of branch out from that and what i mean by that is like i can go to tensorflow or pytorch they have extensive documentation online so if you just search for you like tensorflow documentation pytorch documentation or sometimes i'll search for you know tensorflow transfer learning tutorial let's say and there's one of those right so i go there i want to do transfer learning with tensorflow tensorflow is open source so i can install it and if i find the tensorflow docs then it'll tell me how usually there's like a getting started you know install tensorflow or they'll tell you hey you can try it on you know collab or whatever i'm the same another one kind of hitting these what are kind of the forces the big names in ai that you're reputable and that you know that they're legal teams have looked at things and all that and you kind of there's a trust factor another one that i use a lot especially at work is nvidia because they have a huge amount of documentation online so i'll start from them and see what they have and have a bunch of partners as well as does google as does microsoft as does facebook which is pytorch yeah so there's a lot of good documentation pages out there and i guess in order to find those you kind of have to have a little bit of domain knowledge of like what are the key tools out there i mean you've already heard us mention a few like tensorflow and pytorch but there's other ones like chris mentioned nvidia tools that i think they have various tool sets out there there's spacey and the nlp world there's of course the kind of data science-y python toolkit which is like pandas and scikit-learn and all that stuff i feel like we have an advantage because we know about those things so like when i'm searching for example to do like a maybe a traditional quote-unquote machine learning thing on a smaller data set i might go to the scikit-learn documentation and search for like how to do this thing or that whereas if i'm trying to do like a thing that i know is like an ai thing i might search like on tensorflow or pytorch examples or tutorials on that particular thing and find certain open tutorials and how to install the right toolkit and that sort of thing but i feel like i do have that advantage and i'm not sure what the best way is to get that exposure to the main toolkit i don't know if you have thoughts on that that's a great point and that is that we all based on whatever problem that we're tackling at any point we don't necessarily just use a single tool it's there's not a single go-to thing that you're always going to use for every project if you're a tensorflow person you may use a lot of tensorflow but you probably also use some tools from nvidia use use python tools you know there's a lot of different possibilities on how you might combine a tool chain together to solve a particular problem it may change as you go from problem to problem so i think that domain knowledge is hard to come by so probably you either need to be really focused on self-learning and trying to follow reputable sites around or get into a course there's a bunch of online courses and i know we've talked about we have some episodes that specifically address learning but it helps to start not at square one when you're doing this so that you can be a little bit more efficient quicker yeah for sure i agree with that if you're trying to really level up to kind of state-of-the-art things i would highly recommend the website papers with code yeah you can actually search and there's also leaderboards for common ai tasks whether that be you know image recognition or visual reasoning stuff or other things speech recognition and actually search through the sort of leaderboard of papers and then see the actual links to the tools that they use and also the code implementation so that's a good idea you know even if you just browse around that site i think and look at the various things you'll get a sense of like these are the main things that people are using to do this sort of stuff these are the main things people are using to do that sort of so so i think that could be useful yeah it is it's interesting you bring up a great point that there are different types of things that you may be looking for on one side you might be looking for just raw data and you might go to for instance google's data set search that they released last year which is fantastic because they have indexed many many many data sources and you can start looking that's one of many ways to enter into it it's not the only one but you might also be looking for domain expertise as well and so we've had semantic scholar on the show before and you might go look for some of these scientific papers that are relevant to the things that you're about to tackle or you might be building on top of one of those papers and so developing that domain expertise in the specific area and then also having a diversity of data to tackle the problem with is really important i think that's a really hard thing for people that are new in ai is understanding all these different pieces you have to put together into your workflow to be productive as quickly as possible yeah for sure it's a challenge but i think the situation is better now i think than even a couple of few years ago oh yeah so uh so that's encouraging um there's a lot of tutorials out there for various things we've had the benefit of standing on top of the software development community's shoulders so many of these problems that had we not had that privilege of doing would have definitely slowed down the process so we're seeing warp speed in the ai world in terms of its evolution largely because we can look to other places that are associated or related and say oh that's all it was solved to something very similar changelog news is the best way to keep up with the ever-changing world of software we track blog and contextualize the coolest projects the best practices and the biggest stories each and every week make changelog.com your daily destination or hit this news button and subscribe to our weekly newsletter that hits inboxes on sunday mornings join more than 15 000 enthusiastic readers it'll cost you exactly zero dollars and you can subscribe right now at changelog.com slash weekly all right well we've talked a good bit about the tooling and the code and everything that is out there i'm curious for you like let's say that you're approaching a problem and you're using a new toolkit like the maybe it's the de-graph thing that you're talking about like you're wanting to get into that or you know for me recently i was doing some speech related things there's like a some new speech related stuff out of nvidia that's pretty interesting so there's this new toolkit of things that you have access to but one of the things that i see people struggling with is integrating that toolkit of stuff into your you know local machine to experiment with might not actually always be the easiest thing because you know oh this new toolkit it actually requires this version of numpy and you have this version of numpy but if you change this version of numpy on your local system then you break these 14 other things that you use locally right so i'm curious how you go about that chris so a couple of ways of entering into that answer one is i start with the end in mind am i just trying to learn something new in terms of like a new skill or new workflow and if i'm doing something like that then i might stick much closer to that tutorial specifics and stuff if i want to do that i might do it in a docker container uh entirely where i can control the environment the versions of everything if they haven't done that for me then as i get into the tutorial i'll scan through the tutorial see what they're using and go ahead and set myself up a docker container for that process and that way i have a constrained environment that exactly meets the tutorial's focus i can get through it with fewer problems it's worth the investment of dockerizing ahead of time if they haven't done that for you so that's one thing but in general if i'm not just doing a pure learning spike you know where i'm just trying to figure out how to do this thing that i care about if i'm doing it with more of production or productivity in mind then i think what is the environment this has to meet if i'm maybe looking at a tutorial but then i'll translate it into what are the constraints that i have what are the resources that i have available and i'll take a little bit of time to try to transfer what they're trying to show me there into the world that i'm living in because at the end of the day if i'm not just doing a pure learning spike and i'm doing it to deploy somewhere eventually then it needs to fit into my world and so there's a little bit of prep time there to try to get a sweet workflow on my side going yeah for sure one of the things that i do a lot is if i'm just trying to see like let's say that i'm trying to solve a problem like it's a speed recognition problem or something and i see there's like three different things that people are talking about using out there three different ways of going about it what i might do is just spin up three different google collab notebooks yeah because whatever i do there it's going to get blown away it doesn't affect my local environment at all but it is persisted like in the sense of like the code is persisted it is a notebook so you sometimes have sort of weird state and you're not guaranteed that it's going to run exactly the same way again but it does give you a very quick way of like knowing okay i have this environment tensorflow pytorch pandas you know stacy a lot of other things are already installed and so there may be a couple of things i need to install via pip or something but i can generally run through and get the flavor of how something is going to feel very quickly and oftentimes what i'll do is i'll spin up three different notebooks and try to get this thing to run in the way they're saying and it doesn't work and then i try a different thing to get it running in the way they say and then it's kind of annoying and then i try a third thing and then it seems like not exactly what i want but it seems like the workflow is kind of nice so then i'll start adjusting from there so even just finding the good starting point where you want to put your flag in the ground toolkit wise it can be useful to do it in that way from what i've seen you know if i could go back that is the one thing i would change about if i don't go off into a docker container for pure ease of learning and ease of training nothing beats colab in my view yeah they have the best simplified interface that has everything you need there and so yes given that option i will often use colab to go do that and i know a lot of other people besides you and me that feel the same way sometimes i find myself wishing that other tools would look at colab recognize the ease that they created for their own user and go implement that i'd like to see that kind of use of use everywhere yeah well it would be unfair i think to talk about all the great things that people are putting out there in the open in terms of data and code and not talk about how you can contribute to that or help out a project or maybe it's your own project and you want to open it up what are the kind of flavors or categories of contributions that you think you know people getting into contributing to open source ai what are those sorts of things that they might have in their mind that would be maybe useful things to think about contributing to so i mean some of the common categories of contribution would be obviously the code itself that is the core for that software but code alone often is not enough i can't count the number of times that i tried to work with code that without great documentation and without great examples i found to be extremely hard to utilize in a productive way and so if you don't feel that code is where you should make your contribution and going and figuring out how to use the tool or offering up your insights from using the tool into documentation or create examples i love it when people create examples and so if i'm coming in cold and i really don't understand the tool a lot of times that's the best way for me to ramp up is i go to an example and i refer to the docs from there to try to get there and so those are some of the obvious things and another thing that i would suggest people is reach out to the maintainer of the project and ask them and say what do you need a lot of times there's a slack team or something absolutely and you know tell them you love what they're doing and you would like to contribute and tell them what you think you're good at contributing with and ask them for some guidance on that and they will love you i mean open source projects there are cases where you have paid teams that are maintaining obviously but if you look at the fast numbers the majority of maintainers out there are maintaining you know for free they're not being paid to do that on most projects and so they love it and be sure to tell them how much you love the software as you do it help them with data is there data that i could go out and we can make links to or find data whatever yeah i think that's a good point some of the larger projects like we've been talking about you know tensorflow and others have a large team behind them right but there's a lot of really great tooling out there you know smaller tools that are actually pretty key in the workflow that are developed and maintained by maybe one or two unpaid people that are doing it because they think that this thing is useful so that's one thing to keep in mind also as you're using open source software that when you're using something and it doesn't do quite the thing you want or maybe it breaks in a certain way the way that you go about raising that to the maintainers shouldn't be coming from a place of why does your tool suck so bad you are terrible and need to do a better job the better way to go about it is to say hey thank you so much for you know creating this thing i've noticed this it's a raise an issue on github you can definitely do that but i think the even better way to go about it is to say okay this thing might need a slight modification here maybe i could reach out to the maintainer and see if they would accept a contribution to add that feature in and then you could actually create a pull request and contribute that in it's a much more productive way of going about interacting with with open source projects and for those who don't know what a pull request is it is a mechanism by which you essentially offer up your code to be integrated into a larger code base and it gives the maintainer of that code base the chance to review what you're doing and choose to integrate or not and if they don't there might be a really good reason they'll give you feedback typically on what that is but they're already spending their time so i love what you said daniel about don't just say i need a feature open source is democratized software to some degree go out there talk to them ahead of time and then say i'd like to take a stab at writing code for this and offer it up and they can choose to take it or not and they may give you some guidance if they're grateful for yeah i like what you say as well there is a contribution process that's common to github there's a lot of jargon around that and what we'll do is we'll include as a learning resource on this episode there's a couple of really good blog posts out there about this whole process where there's a repo on github you maybe want to contribute when we're saying contribute it could be something small right like if you see in a project's documentation that you know they have this error in their documentation and it's just a wording change right or maybe change of a variable name or whatever it is in their documentation it's a small thing you of course could create an issue on github and say hey you need to fix this but it's super quick and not that hard to just go to their repo see how they have their documentation laid out in their repo and then it's a matter of forking that or making a copy of that repo pulling that down to your local machine you know making the change pushing that back up to github and then creating this thing like chris said like the pull request and so that's like a no-brainer you know you don't need to know that the contributors are going to want that change they may reject it but i think more than likely they're going to be just happy that people found something wrong with their documentation and fixed it so yeah i think that that workflow will give a good link if you're new to github if you're new to get in this process of pull request and all that we'll put a link in so that you can learn a little bit about that yeah absolutely another way to contribute is if you're using that software and it's working for you well and you're solving something that's important to you share that process not just what you're doing but how you did it how you use the software in a blog post and so that doesn't actually directly require interacting with the maintainer of the software but it is showing appreciation it is giving back to the community by showing how to use it effectively and they inform them about not only what you've done but your workflow and all of that is really useful to other people and very community-minded yeah actually there's a couple of really good blog posts out there about building ai workstations like personal computer with a gpu and all i relied on those very very heavily because it's been so long since i put together a computer of my own like my reference frame was like way back when stuff was named differently and like processors were not near what they are now and so just like getting a bearing of like what range of things i need to be looking at here and what configurations are people going after that was really useful so even if it's like a guide like that installing cuda and getting your new gpu running you know that sort of stuff is really useful um and of course there's a lot of that particular blog posts out there probably but there's other things that aren't i'm a pachyderm user we've had them on the show there's this new github actions thing that people might be familiar with where you can kind of automate tests and deployments through github and i asked in their slack like has anyone tried to do like a data pipeline from github actions like um and there were a couple people that responded in slack like no but i've been thinking about trying it and that sort of thing so like if i end up doing that i think that would be something that would be really great it's probably not something that they're going to pull into their main repo maybe because it's sort of an auxiliary thing but it would be something that would be really nice for a blog post so that those people out there that are trying to do that thing could find a resource and do that yeah you know as you say that a thought occurred to me that i think it's something that hasn't matured in the ai world that needs to and that is the fact that by comparison if you look in the software world not only do you have communities around specific software packages that are developed but you also at the same time have a general sense of community around open source software that even transcends the specific language and libraries that you're in you can go from one language to another and there may be little changes and stuff and how that the sub communities work there is an understanding and what is expected in open source in general i think that we're not there yet with ai and i think that would be something i know from our conversations that we both love to see is instead of just having specific data sets or specific software packages a sense of open ai and a larger community sense being built in a sense of community so whether you're using you know pytorch with a particular data set or tensorflow or whatever or nvidia stuff it doesn't matter there's an overall sense as you move through these communities on what to expect in that ai world and i think i've met so many people in ai that did not come from the software world and did not already have that built into it that we have some integration to do on that so i'd like to see that happen going forward here yeah and there have been some you know encouraging signs on that front i think both tensorflow and pytorch have developed their various hub sort of environments where you can share you know setups and models and configuration and all of that so that's kind of nice like there's this kind of sense that people are building these hubs and i also think about of course the hugging face team that now has just tons of models that are available in their open source project i saw a tweet i was just pulling it up from clem who was on the show quite a while back from hugging face i will link to that episode but his tweet was 25 team members plus 400 open source contributors plus machine learning equals fastest technology building i've ever seen which i think is definitely true you just look at the pace with which they're developing you know being i guess what he's saying 25 actual team members now but 400 open source contributors there's sort of these pockets of the community like you're talking about and so i i hope that we see that growing i do too i think that's a great way to finish it i mean it's not just about your team you're really standing on the shoulders of an entire community of people out there uh that contributed to tools and the data available and all that all of us are in that position so as folks move forward be thinking about how you can give back to this community and build that sense community great conversation today daniel yeah for sure that was a great idea thank you for coming up with it yeah definitely hope you enjoyed the hot weather and stay safe stay inside and we'll talk to you soon we'll do take care thank you for listening to this episode of practical ai people ask us all the time they say hey how can i support your work one easy way is to leave a five-star review on apple podcasts tell folks why you listen and why they should too it only takes about 30 seconds and believe it or not those ratings that reviews really do help us rank higher in ai-related search results practical ai is hosted by daniel whiteneck and chris benson it's produced by jared santo that's me and our music is brought to you by the one and only breakmaster cylinder we are sponsored by amazing people at companies who get it thanks again to fastly the node and roll bar did you know we have a master feed of all change talk podcasts we do it's your one stop shop for everything we produce if you like this show you love the change log brain science and go time check it out at change talk.com slash master or search for change log master in your favorite podcast app you'll find us that's it for now we'll talk to you again next week you