Accelerated data science with a Kaggle grandmaster episode artwork

EPISODE · Apr 4, 2023 · 43 MIN

Accelerated data science with a Kaggle grandmaster

from Changelog Master Feed · host Practical AI LLC

Daniel and Chris explore the intersection of Kaggle and real-world data science in this illuminating conversation with Christof Henkel, Senior Deep Learning Data Scientist at NVIDIA and Kaggle Grandmaster. Christof offers a very lucid explanation into how participation in Kaggle can positively impact a data scientist’s skill and career aspirations. He also shared some of his insights and approach to maximizing AI productivity uses GPU-accelerated tools like RAPIDS and DALI.Sponsors:Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.comFly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs. Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with extended episodes, make the ads disappear, and increment your audio quality with higher bitrate mp3s. Let’s do this! Featuring:Christof Henkel – GitHub, LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:Christof Henkel | KaggleNVIDIA Kaggle GrandmastersKaggleNVIDIA RAPIDSNVIDIA Data Loading Library (DALI)Upcoming Events: Register for upcoming webinars here!

Daniel and Chris explore the intersection of Kaggle and real-world data science in this illuminating conversation with Christof Henkel, Senior Deep Learning Data Scientist at NVIDIA and Kaggle Grandmaster. Christof offers a very lucid explanation into how participation in Kaggle can positively impact a data scientist’s skill and career aspirations. He also shared some of his insights and approach to maximizing AI productivity uses GPU-accelerated tools like RAPIDS and DALI.Sponsors:Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.comFly.io – The home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs. Changelog++ – You love our content and you want to take it to the next level by showing your support. We’ll take you closer to the metal with extended episodes, make the ads disappear, and increment your audio quality with higher bitrate mp3s. Let’s do this! Featuring:Christof Henkel – GitHub, LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:Christof Henkel | KaggleNVIDIA Kaggle GrandmastersKaggleNVIDIA RAPIDSNVIDIA Data Loading Library (DALI)Upcoming Events: Register for upcoming webinars here!

NOW PLAYING

Accelerated data science with a Kaggle grandmaster

0:00 43:52
of MATCHES

TRANSCRIPT · AUTO-GENERATED

Welcome to Practical AI. If you work in artificial intelligence, aspire to, or are curious how AI-related technologies are changing the world, this is the show for you. Thank you to our partners at Fastly for shipping all of our pods super fast to wherever you listen. Check them out at Fastly.com and to our friends at Fly, deploy your app servers and database close to your users, no ops required, learn more at fly.io.

Welcome to another episode of Practical AI. This is Daniel Weitnack. I'm a data scientist with SIL International and I'm joined as always by my co-host Chris Benson who is a tech strategist at Lockheed Martin. How are you doing, Chris?

Doing well, Daniel. How are you today? I'm doing great. Chris, have you ever been called a Grandmaster in anything?

No, but I really wish I had because it's a freaking cool name, man. Our title. Aren't you like a street fighter or something? You were like a black belt or something?

Oh, yeah, don't go that. It's something like that. Thirty years ago. But yeah, once it went, I was a kid, but you know what?

I was never, I was never a Grandmaster at anything. I was just trying not to pummeled. Yes, I was just trying not to hit the mat and that's it. Okay.

Well, today we have with us an actual Grandmaster, a Kaggle Grandmaster, Christoph Hinkle. Who's a senior deep learning data scientist at NVIDIA and a Kaggle Grandmaster, multiple time master, by the way. Yeah. Yeah.

In multiple of the different categories. So welcome, Christoph. It's great to have you here. Welcome, Daniel.

Welcome, Chris. Very happy to be here. Awesome. Yeah.

Well, for those that aren't familiar with this concept of Kaggle Grandmaster, could you kind of give us the briefing on what exactly that means and in the context of also Kaggle? What generally I think a lot of people are familiar with that, but just in case, what is Kaggle and what does it mean to be a Kaggle Grandmaster? Yeah. So what is Kaggle?

I would say is like a platform for machine learning in general. It started off as a platform for hosting machine learning competitions. That's how it became popular. But in like the recent years, it also expanded for like being a platform for discussion, being a platform for sharing notebooks, they're hosting millions of data sets.

So they're trying to become really like the go to community for every topic around data science. And it's free to register for everyone. And they also provide some free resources where you can run code and try different stuff from competitions. And on this platform, they introduce different tiers in order to gamify a little bit, so to incentivize users to post content or to participate.

So there are four different areas in which you can reach like different levels. So they are like competitions, which is like the most famous one. There's also notebooks where you just progress by sharing notebooks with others. And the progression is based on upvotes on your notebooks.

Then there are discussions which work in the same format. So you post an answer to a question or you post an interesting topic. You can also post just memes and generate upvotes in this way. And then there's data sets.

So you can also post an interesting data set or a data set you think might be helpful for others. And then people can upload your data set and by this you progress. And you basically progress by earning medals, they're like bronze, silver and gold medals, they need you up to four areas. And then with these medals, you can reach like different tiers.

So you start with as a novice, I think, then your contributor expert, then you sample your master and like the very last stage is a grand master. So you can put that into a first section. So from the 10 million users that are registered in Kegel, they are 280 competition grandmasters. So it's really like the elite of the elite, the top notch people in the area.

So I have to ask because we were talking about it, which of the three categories are you a grand master in? And what's the fourth one that you're not? And of course, I'm going to ask you when you get to become a grand master in the fourth one. I'm a grand master in competitions.

And that's the most difficult one. Indeed. Then I'm a grand master in notebooks because I shared some high value notebooks. And then I'm also a grand master discussions because I like to discuss stuff.

That's also why I'm here. But I'm not so fond of curating data sets and uploading data sets. I can't blame you. That's why I'm one of your beginner.

I had the data set. That would be the one I would choose first. See, that's Daniel. Daniel loves to do data grunging and stuff.

It's sick. That's terrible. So I understand. I give you a pass on not being a grand master in the fourth one there.

What got you into Kegel in the first place and what was the journey like towards where you're at now? Some people might just be jumping in on Kegel and like trying things and they have like a vision of how far this could go. But what was the journey actually like for you? I think it's quite interesting because my journey began right in the last months of my PhD.

So I did a PhD in mathematics and in the last few months. So after I sent out everything and I just was waiting for my defense, there was suddenly some free time and also free weekends. I wasn't used to during the PhD. And I was also always curious about the AI topic.

So back then it was like five, six years ago. It was not so hype as now, but was like, I need to share what the neural networks and so on. And so I was just curious about that. And then I watched some YouTube videos, started a Coursera course on like what are neural networks and so on.

And due to that, I quite quickly found out about Kegel and then just started with my first competition right away. And since then I'm booked in the system. And how long has that been? Six years now, I think.

And during those six years, also my professional life progressed more and more towards machine learning and deep learning and data science. So six years ago when I joined Kegel, I was working as a risk analytics consultant. So I had nothing to do with machine learning. I had nothing to do with data science.

I programmed a bit on risk models. I had some background in like our programming or met lab, but I never used Python before. And then you took a little bit of my professional career shifted towards machine learning and deep learning until right now I'm working as a deep learning data scientist at NVIDIA, which is like one of the top-notch companies in this area. Yeah, that's like the gold standard of jobs in the AI world right there.

So do you feel like the experiences on Kegel and your success there? In what ways did that kind of contribute to your own sort of career advancement and also like your understanding of what you wanted to do as your career advanced? Yeah, it really had like a lot of impact. So step by step, I moved into the position I'm right now.

So when I started, I was doing Kegel like before and after work a bit, not too much like half an hour, half the work, half an hour before work and weekends. And then I made some, and of course I did horribly on my first competitions because I had no clue of anything. But the nice thing is that you really progress step by step. So in the first competition, you do horribly, the next one you do badly, but not horribly.

And then I did progress more and more until you become better and better. I quite quickly realized that a lot more fun and like machine learning and deep learning than on risk consultant just because you can be more creative, I would say. I moved within the consultancy company. I was lucky that they also had like a data science team.

So I moved to the data science team there and I had my first synergy effects between Kegel competitions and what I learned there and what I was using in projects. So I could use my skills in the projects and I could also use skills again to the projects and Kegel competitions. But that was kind of five, six years ago. There wasn't much deep learning in the industry, especially in the insurance industry, where the focus was in my consultants company.

So I was not challenged enough, but I wanted to do more and more in this year. That also my skill set room one one. So I decided to quit this job and found my own deep learning consultancy, just to have like even more synergy between projects and between Kegel. It tells us a little bit about what that was like in those days, because as we've grown up with deep learning over the last few years, I would guess that at least in the beginning, it was a little bit challenging to land, you know, engagements maybe or was it or did you have them from the start?

Because I know for me early in that phase about the time, I started the podcast as people like deep what? So did you have any challenges in those early days that have obviously evaporated as the world has taken this on? Certainly, not only in terms of projects. So people, especially the decision makers, I would say, they are really cautious about the possibilities you can do with deep learning, especially five, six years ago.

They weren't any resources around. So I talked with customers about what amazing things you can do with deep learning, and then they didn't have a single GPU. They had access to. So that's like really like two words clashing against each other.

So they were a lot of interesting and challenging problems around that. But as soon as they basically gave me a chance and I could do some prototype and I can really show what you can do, then it was easy to convince them. But to get to this point, especially as like a young startup, the young consultancy startup, that was quite difficult. So I definitely want to get into many things later on, but I'm also thinking about these people out there that are maybe, you know, inspired by your journey and wanting to get involved in Kaggle and other things.

I wonder if you can like share a little bit about his while, while you and Chris were talking about perceptions around deep learning that have shifted also during that time, like the tooling around deep learning has shifted. And like the accessibility of maybe like thinking about four years ago, if I was to train a deep learning model for a Kaggle competition versus like being able to do that now, how have you seen that shift over that over that time period in terms of this sort of ability for people to, I guess people use or democratize or whatever, the ability for people to hop in and do something advanced like that very quickly. They're like two aspects, I would say. One is like software wise and framework wise.

Then there have been a lot of progress there. So when I started, it was still like tens of low zero point something, which was working, but it's really like non-level programming. So there was nothing like an RNN or a transformer layer or so, you need to code everything from scratch, but it also helps a lot for understanding the things. So I think nowadays people don't really understand the granular aspects of deep learning because you just do something like modern dot fit and you don't have any clue what's happening behind the curtain.

So certainly it's easier nowadays to train a model just by this higher frameworks. Just calling by name, there's not only stuff like Kiaras, high-torch lightning, there's like a lot of different frameworks you can use, which are really high level and accessible for beginners. And there's also a lot of training material for this frameworks. So a lot of tutorials.

So it's really easy to train a simple model for a simple task, but also in terms of resources, I think they are more beginner friendly because on Kager, for example, five years ago, they didn't give you any resources. There was no Google Colab. So you basically had to have your own GPU at home. You need to build your own desktop machine or something while you spend your own money on cloud resources.

But now for beginners, you can get access to Colab, which gives you a free notebook to experiment, you get some free resources, some Kager, there's a lot of student credits and student programs. So it's really easy to start your data science journey, I would say. And there's also a lot of more material online. They can really teach yourself.

Hello friends. This is Jared here to tell you about ChangeLog++. Over the years, many of our most die hard listeners have asked us for ways that they can support our work here at ChangeLog. We didn't have an answer for them for a long time, but finally, we created ChangeLog++, a membership you can join to directly support our work.

As a thank you, we save you some time with an ad free feed, sprinkle in bonuses, like extended episodes and give you first access to the new stuff we dream up. Learn all about it at ChangeLog.com slash plus plus. You'll also find the link in your chapter data and show notes. Once again, that's changelog.com slash plus plus.

Check it out. We love you with us. So Christoph, as you were kind of leading in talking about your entry into the world of deep learning and your career shift to accommodate that and you're talking about kind of learning from Kaggle competitions and engaging in that. And then it was increasingly applicable in your professional life.

Can you talk a little bit about how that happens? Like when you're thinking about a Kaggle competition and you're now working in a job in this field, how did the two relate? How are Kaggle competitions relevant to solving real business problems in a real job and getting that synergy? What is that like?

What is the connection between the two like? I would say there are a lot of synergy aspects. So doing it, every competition is really very similar to doing a project at work, which is about performing first prototype. So in the Kaggle competition, you get like a problem, which you're not familiar with, often from a different domain can be from biology, can to from astrophysics, can be from chemistry, can be Bengali language, sign language, just so much different problems that you have no clue about when you start.

And then you have like three months time to find like the best possible solution and also compete with other data scientists. So like this prototype project characters, very similar. So you have like this three months time window, then you have a collaborative part. So in Kaggle, you can also form teams.

So you can participate in competitions in a team, which is very similar to working in a team in your job with all the ups and downs, I would say, working in a team under pressure often. So Kaggle competitions can create quite some pressure, more pressure, you might feel your day to day job. So you also get used to working efficiently with others. So in terms of coding, in terms of reading their code, in terms of structuring the project and so really like all aspects of project management are also important.

And also things like optimizing runtime and optimizing code structure. You wouldn't think that it's quite important. But I think it's quite important also for Kaggle competitions, because recently they run the competitions or they restricted hardware. So you just submit your code and they will run your code on their infrastructure using the Kaggle notebooks.

So you need to have your code in a way that it's kind of production style. That's also what you would do in a project. So you would develop ideas and so on and so forth. But at the end, you want to productize your code and you need to think about all these MLOps problems as well.

And you also train those skills in Kaggle competitions. So I really like a parallel between the two words. That's it. I must say that two things are really different between Kaggle and the real world project.

First thing is data acquisition. It's like a very big topic in like the real world. It's no topic, well, no topic, but a very minor topic on Kaggle competition. You already have your training data.

Of course, you sometimes can expand your training data by looking for more data online. But in general, you already have like a fixed training set you can work with. Whereas in the outside world on the real world, that could be like the main problem just to acquire some data. And the second thing is definition of the metric.

So in Kaggle, people are like evaluated based on some metric and this metric is predefined before the competition starts. Whereas in the real world, that can be a discussion which takes for ages between like data scientists, the business and just creating a metric that is representative of the business problem can take a lot of time and you don't have this issue with discussions of Kaggle. I'm curious as you were describing that, I have an idea that came to mind. So recognizing the limitation of you already have data provided and recognizing the fact that the metric is well defined on a Kaggle team and both of those are kind of optimal situations compared to the business world.

But from the perspective of an organization out in the world, any organization that is keenly interested in data science and stuff, would forming Kaggle teams or participating in Kaggle teams be a good recruitment tool? Because if you can find people that are performing well on teams in that capacity, it doesn't check every box, you know, for what the business world doing, but it kind of gives you a sense maybe of this might be someone who could fit in with us. We're going to throw the messiness of data sets and the messiness of metrics on top of that. But what do you think of that idea?

Is that something that people might be thinking about in terms of trying to build data science teams or the organizations? Certainly, I think that that would be a great idea. People do this and some companies already use Kaggle as a hiring tool. So in order to run a competition, those competitions are sponsored by someone.

And there are sometimes companies who sponsor a competition, but also tell the participants that they are hiring and that if you are finishing like in the top spot, you can apply for a position there. So getting a position is kind of part of the winning price sometimes. So they already see that Kaggle is very good for finding good candidates. But as you said, you could also and Kaggle nowadays even offers the concept of a community competition where you host the competition by yourself without any keglet of yarns and you could do this as kind of an assessment center for filtering potential hires or see how they interact on the problem or see how they work together.

There are also so many caring competitions like three months or so, but there are some formats, for example, Kaggle days, which is like conference type of thing. They host like this conference specific competitions and they just go like one afternoon and people get like a simple data set and they have one afternoon to get like a good solution. And they could definitely see how this would benefit an assessment center, for example, because they really see like the whole range of skills people can bring to a company. I have to ask of the sort of competitions and the notebooks that you've contributed to Kaggle, maybe the discussions to whether what are some highlights for you, like of all the things that you've done, what are some highlights of the things maybe you're most proud of or that you would like to highlight.

And most proud of certainly are the Google landmark competitions. So there's a competition, which was also three times yearly by Google. And this is about specifying popular net marks. So you have a data set of five million images.

So it's really like large scale. And then there's five million images, you have 80,000 classes, so 80,000 different landmarks and you need to classify between those landmarks. And the difficulty, especially the else that was some landmarks, you only have one or two images, which makes it quite complex. So to classify and another complexity is because the some landmarks are quite like looking differently from different angles.

You can think of a museum, for example, people take a picture outside of the museum, people take pictures within the museum and you still would classify it as the same landmark, for example. So the competition is quite tricky and I was able to win it three times and two times of that without a team, so just solo. And that's something that's even harder in doing Kaggle competition. So without participating within a team, but soloing, that brings a lot of additional, let's say mental stress, because you're not, you don't have a team, you can talk about your problem where you're just like isolated, working on a problem for three months with like high pressure and so on and so forth.

So that brings another level of like mental component to the game. So I was quite proud that I could win two of those or win three competitions and two of those without anything. So I'd like to follow up on that. What is you're talking to people out there that might be either already participating in Kaggle, you know, not at the level that you're at or thinking about jumping in?

What are some of the attributes that you and I want you to take a moment and harp a little bit on yourself, I'm asking you to and say, what are you bringing to the competitions you think that really has given you an edge in getting to that Grandmaster level and being so competitive at that level? Do you have anything that you can offer people that are kind of maybe a little bit intimidated by it or trying to think, how can I level up a little bit? What would you say? I mean, I definitely have some analytically thinking just from my study of mathematics, because the whole study is there to basically learn how to think efficiently, how to solve problems efficiently.

So that definitely helps. And coming from natural sciences in the broader sense, also a sense of solid experimentation is very important. So really having like a clean workbench, so to say, logging your experiments, following up on ideas and so on. So this really like thinking like a researcher in natural sciences and following your experiments, making a cleaner and reproducible way.

That's also quite important. But I think what really pushed me to the top level is the curiosity of different domains. So even like top people, they tend to, let's say, in quotation mark, lean back and do what they're good at and not expand and learn further. But I would say one more edge I get is that I really try a lot of different ideas and at the end in different areas, I try to explore like very different competitions, very different domains.

And at the end, every now and then I can leverage from something that you would think has nothing to do with the other, but you still can leverage some ideas and apply some concepts. So for example, you can transfer knowledge from audio classification to biology or to astrophysics or from NLP to computer vision and vice versa. So there's a lot of synergy people wouldn't think about and therefore it's quite helpful to explore as different domains as possible. You alluded to this a little bit in what you were saying about use to with Kaggle competitions, maybe you had to build your own machine with a GPU in it to sort of operate in that now there's good resources with GPUs.

But I'm wondering from your perspective, both as a competitor and a Grandmaster, but also as a really senior data scientist at NVIDIA, how do you view kind of GPU acceleration as kind of important and playing a role in Kaggle competitions? Probably most people think about it in terms of training a model. But how do you think about that more holistically in terms of like the accelerated process that's key to performing well in competitions? So certainly GPU based programming or like calculation is like the bread and butter of training any model nowadays.

But also, especially in video, they're looking more and more into moving other parts of your data science pipeline onto the GPU just to make it faster. And especially for Kaggle competitions, the speed of which you can run yourself and try ideas is very important. So when a lot of people like on the top level compete against each other, one of the edges you get is when you can do more experiments than the others, which are just bound by of course your ideas. But most of the time I'm not running out of ideas by running out of time in the competition.

And so as long as I can run more experiments than other people can do because I have a more efficient pipeline or I can run more parts of my pipeline efficiently using GPUs that gives me an edge. And some examples of this are like data pre-processed. Oh, let's start even one step ahead. The first step is just data loading, just just not knowing your data frame for doing anything can be GPU accelerated and it's just like 100 x faster.

So every time you're working on the problem, you get a 100 x speed up just in the step of loading your data. And that's what Rapids, for example, is all about. So Rapids like an immediate tool stack, which is all about accelerating those parts, which are not like training the model, but are like what is normally entered with pandas, for example. So they have a part which is called Cooley F, which is basically pandas on GPU.

They have something which is cool. And which is basically a scalar and on GPU. So things like clustering, all this stuff you can do on on GPU nowadays. Other examples are, for example, NVIDIA, that's a tool, especially for image processing, but also support audio and video.

But in example, there would be decoding of JPEGs. So people wouldn't think about that, but something like having a JPEG on your disk and just loading the JPEG involves some decoding step, which basically decodes the JPEG format. And this can already be done using GPUs and can be accelerated by GPUs and also gives you a significant speed up during your training, during anything which uses the images. So there's a lot of different steps in your pipeline that you can accelerate.

And that's what all accelerated data science is about. So NVIDIA tries to move the complete pipeline, wrong loading the data to a saving, like conclusions, results, all end to end on GPUs. Yeah, that's really interesting. And I'm guessing that some of the things that you're talking about, like loading images or loading data frames or manipulating data frames, maybe doing certain operations, doing clustering, I don't know that this is the case, but I would guess like those things pretty consistently show up across competitions too, or in the real world, you could think about them as showing up across many different business problems.

So like you were talking about your pipeline of processing, which I think is a really wondering if you can dig into that a little bit, not a specific pipeline, but how you think about like solving a problem, because most people might come to a Kaggle competition or a real world problem and say, okay, here's my data. My main step is this sort of like training of the model and maybe evaluation. Like how good is my model, retrain it, how good is it retrain it? How do you think about the data sort of pipeline around, you know, like you're talking about running experiments?

What does that sort of like data pipeline look like in your mind? And what are some of those reusable components or things you find yourself doing over and over again that are accelerated, that you found accelerated ways to do those things using GPU tooling like Rapids or this Dali? It really depends from project to project, I would say, where it's applicable or not. So I would say that Rapids for example is even more applicable to the real world, because there you might have very larger data frames, for example.

So if you're like a bigger company, you have like user data or you have blind data, whatever, because the Kaggle competitions often are like packed into little problems that people can work on and not like this company size large scale data sets with like millions of users or thousands of users. And the things like Rapids, especially shine in like this large scale data sets. For me, my pipeline is, I would say modular, so I'm that developed through the years coming from the competitions. So of course, I try to reuse as much as possible, just to be efficient.

So I have a really modular setup where I have one part, which is just the model training, one part, which is about the storage of my data, one part, which is about logging the experiments and tracking results and visualizing results, one part, which is about the framework set up, so to say. So I use Docker with a specific PyTorch image to have like always the same environment and also can replicate my experiments and also can use the exact same environment of different machines. So in the cloud, locally, that's all things I learned during the years. So a little bit complicated to explain the old pipeline now on the podcast.

I actually gave like a one hour presentation two weeks ago, just about this topic. So it's pretty difficult to come down to a few, a few sentences. It's hard without a diagram for sure, but it's super interesting to me, like the things you're talking about that you've made modular. I think are things people operating in a real world data science environment eventually need to make into sort of like components that work within their team, right?

Like, you know, my team, like we love using, for example, streamlet to do like some data manipulation, visualization, interactive stuff on the other end. And we have a lot of those. We reuse a lot of those components. And, you know, we have like certain models that we multilingual models that we train over and over.

So we got, you know, modules around that and then like pre-processing and other things. So I think these are, it's interesting how much what you're talking about overlaps with, I think the efficiencies you gain over time as a data science team operates together and they learn how to make their own processes more efficient. So I think that that's really interesting. So I have played around with rapids a few times and it is really cool.

And I'm just looking at the latest stats here on the rapids website. It's talking about performance on 300 million rows by two column data frame with like the highest speed up being for like group by operations, like 80 times faster than not using rapids. So like, I don't know, you know, how long, you know, that saves you, but also like you're talking about if you were doing experiments over and over and you want to rapidly do experiments, even if that saves you, let's say it's something smallish, like in minutes, right? A couple minutes, like you're able to do things much faster and automate thing like your automation goes faster.

You can learn things much faster and reduce that cycle time. Although I'm also assuming for many people for their data, it might be more than a more than a minutes long speed up potentially on some of those operations. So yeah, I don't know when you're when you're helping people and you mentioned the discussion groups and the notebooks that you worked on on Kaggle. Is this something where you've seen like light bulbs come on for people when they like saying like, I'm trying this group by operation or something on this data and it's taking me like 15 minutes every every time I run through this.

Is that something you've been able to bring in those discussions and notebooks and such on Kaggle? Yeah, certainly. So like loading data frames is a good example. So 80 times sounds not that much I think, but it's like one minute or two hours.

That's like the scale you're talking about, like loading your data frame in two hours or loading it in one minute. That's like an 80 X speed up difference. And especially in Kaggle, those discussions get a lot of traction because on your inference, you actually have like a time limit of like nine hours. So people try to get as much stuff into their submissions as possible.

So loading data frames, manipulating data frames, loading images, all the stuff. If you can speed it up, the people with a very, very great fully adapts, whatever you give them to speed up their stuff. And that's only the inference side. So that's even more true for training because as you said, my day to day is like doing a lot of experiments and those speedups accumulate.

So the very first thing I ever do in a competition, like the first two weeks or so, I just optimized my workflow. So I optimize all the runtime, optimize how I load my things, accelerate all the pre-processing, post-processing, whatever I have in my pipeline. So I can then leverage the remaining time from like the most perfect setup or the most perfect code because then I can just run more and more experiments. So I'm curious because as you have been talking about optimizing and being able to do all of these iterations on your experiments, there are people out there, including myself, they're thinking whether they are wanting to jump into a Kaggle competition, they're psyched up because they've been listening to how you've kind of mastered this process or they're working for a company and they are trying to get their own systems better and better and early teams really struggle with that.

And so either way, with you talking about what you've done and Daniel is jumping in and talking about the way they are, there are people that want to be there with you, they want to at least get on that path, do you have some concrete recommendations on somebody who's at the beginning of that and they're like, okay, I'm doing data science, but my God, it's taking me a long time to get through each iteration and I'm listening to this grandmaster just cranking out productivity so fast. What are a couple of specific things that you would say go to this and that and that recognizing that they'll find their own path forward and they'll make their own adjustments, but how do they get on that path to begin with? The first thing, and I told this several times, is just to start your very first Kaggle competition. So you go to Kaggle.com, you look through the ongoing competitions, which is like 15 to 20 ongoing competitions, it just shows any topic you find interesting.

You don't need to be an expert in this topic, you don't need to even know about the domain or something, but just starting is like the first step. And as soon as you start, just by the sheer amount of knowledge, which is shared within the forums and the notebooks, you will see that you will learn very, very efficiently how to improve your code, how to improve your skillset and you get like immediate feedback on the leaderboard, for example, while on discussions, if you're like a comment and it doesn't make sense, then people will tell you, if it doesn't make sense, people will also tell you and thank you. So they did about this very like an objective way and seeing your performance and seeing your progression. So that's the very first advice I would give someone, try to find an interesting competition and just start.

There's basically nothing to lose, you just can gain knowledge. As said, you will perform poorly on your first competition no matter how whether your account come from, but just starting is like the first step. And as you start, I think the best advice is that you start simple, as simple as possible and just try to progress from that. You start with a very simple model with the subset of the data or with like images, which are down sampled to a low resolution, justified like an efficient pipeline and to work on your code, because all this is like an investment for the future and all this gives you an easier setup to work on and to improve on.

Yeah, really good advice. I think that part you talked about about like spending a couple of weeks optimizing the sort of inputs, outputs and those portions of your pipeline so that you can really put a lot of your focus on fast iterations on the model or that middle bit. I think that's really, really good advice. This has been a really fascinating competition.

I have a long way to go to be a great master, that's for sure. But as we wrap up here, this discussion about accelerated data science and the Kaggle competitions, what are you excited about sort of looking to the future? You mentioned that you have, you're curious about all of these sorts of different domains. You've worked on a lot of different problems.

What really excites you right now as you look towards the future in terms of things that you want to try or just in general, the things that you're excited about in terms of the tooling or the community around what you're involved with. I would say in the short term, I'm definitely excited about or interested in how AI will support my work. So something like GitHub, code, pilot or other not language models, which help me code. I haven't tried that much, but I think that in like the near future or the short term, those tools will support our everyday life in some way.

But I'm even more excited in like the long term prospect, like what will happen in 10 years and 20 years. And that's really excited because if you think back, like 10 or 20 years in terms of AI and what systems could do and where we are right now and you extrapolate that into the future, that will be very exciting and amazing, which will be what will happen then. Yeah. Yeah.

I think that's a great way to wrap things up. Thank you so much for joining us, Christoph, really looking forward to following your progression and the things that you work on in the future and the great things that continue to come out of NVIDIA. So thank you for your work and thank you for taking time to join us. Thank you for having me.

Thank you for listening to Practical AI. Your next step is to subscribe now, if you haven't already. And if you're a longtime listener of the show, help us reach more people by sharing Practical AI with your friends and colleagues. Thanks once again to Fastly and Fly for partnering with us to bring you all chain talk podcasts.

Check out what they're up to at fasty.com and fly.io and to our beat freakin residence, breakmaster cylinder for continuously cranking out the best beats in the biz. That's all for now. We'll talk to you next time.

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 43 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on April 4, 2023.

What is this episode about?

Daniel and Chris explore the intersection of Kaggle and real-world data science in this illuminating conversation with Christof Henkel, Senior Deep Learning Data Scientist at NVIDIA and Kaggle Grandmaster. Christof offers a very lucid explanation...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!