COVID-19 Q&A and CORD-19 - Changelog Master Feed

What this episode covers

So many AI developers are coming up with creative, useful COVID-19 applications during this time of crisis. Among those are Timo from Deepset-AI and Tony from Intel. They are working on a question answering system for pandemic-related questions called COVID-QA. In this episode, they describe the system, related annotation of the CORD-19 data set, and ways that you can contribute!Sponsors:DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog. AI Classroom – An immersive, 3 day virtual training in AI with Practical AI co-host Daniel Whitenack Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com. Featuring:Timo Möeller – GitHub, LinkedInTony Reina – GitHub, LinkedInChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:COVID-QACORD-19Episode featuring SpaCySentence transformersEpisode on BERTGerman BERT from Deepset-AIHugging Face TransformersHaystack from Deepset-AIDeepmind molecular structure with RL projectUpcoming Events: Register for upcoming webinars here!

of MATCHES

TRANSCRIPT · AUTO-GENERATED

Being with your change log is provided by Fastly. Learn more at fast.com. We move fast and fix things here at ChangeLog because of rollbar, check them out at rollbar.com and we're hosted on Linode Cloud servers head to linode.com slash change log. This episode is brought to you by DigitalOcean.

DigitalOcean's developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99 uptime SLA and 24, 7, 365 world cloud support to back that up. DigitalOcean makes it easy to deploy, scale, store, secure and monitor your cloud environments. Head to dio.co slash change log and start with a $100 credit.

Again, dio.co slash change log. Welcome to Practical AI, a weekly podcast that makes artificial intelligence practical, productive and accessible to everyone. This is where conversations around AI, machine learning and data science happen. Join the community and Slack with us around various topics of the show at change.com slash community and follow us on Twitter if you're at Practical AI event.

Okay, here's Daniel and Chris. Welcome to another episode of Practical AI. This is Daniel Whiteneck, one of your co-hosts. I'm a data scientist with SIL International and I've got my co-host with me, Chris Benson, who is a principal AI strategist at Lockheed Martin.

How you doing, Chris? I am doing okay. I'm safe and I'm well and so is my family, so that is a good sign for me. Yeah, that's great.

I'm recording in a new location because we've got a bit of a full house at the moment with my brother-in-laws being back from college and living with us and so I've transitioned my studio out into the dining room. So it's been an interesting week in that sense as well. We're all making adjustments these days. The world is an unusual time.

Yeah, yeah, we're all making adjustments. Yeah, I've actually, it seems like I've been more busy work-wise since the crisis started than even before because SIL has been making various efforts to contribute in beneficial ways related to COVID-19, including trying to translate the phrase wash your hands into as many languages as we can. Part of that is machine-generated and then part of that is just crowdsourced translations and I think we're up to 454 in my last check. I know, I saw you tweet about that a few days ago and so if you're not following Daniel, you definitely should and you can see the work that he did there.

Yeah, and those conversations and that work also led to some other discussions with one of my contacts at Intel and she pointed me to this other project called COVID-QA, which some people at Intel started collaborating with this team from Deepset AI. And I was super fascinated by this project and also interested in potentially contributing because they're looking to up the language support as well. But I had a conversation with them and they've agreed to be on the podcast today. So we've got Timo from Deepset and then also Tony from Intel.

Welcome guys. Hey, welcome. Thanks so much. Thanks for the introduction.

It's really been a great week and two weeks actually because this is how we started this COVID-QA project. But should I talk a little bit about myself first, we can work? Yeah, please do if you want to introduce yourself and then we'll ask Tony to do the same. OK, yeah.

So here I'm Timo and I'm a co-founder of NLP startup in Berlin. And I would say a total NLP in natural language processing geek. I studied data science and computation neuroscience and then co-founded the startup Deepset two years ago in Berlin, which is actually a really great place for a startup. A lot of talent are coming here and also a lot of the open source companies are based in Berlin, for example, Spacey or maybe you know, a rasa.

Yeah, we had Spacey on the podcast a little bit ago. And yeah, Berlin sounds like quite a place to be a developer. I definitely need to make a trip there. I was going to say we need a practically I-road trip.

Exactly. Yes, totally. Yeah, it's totally great. vibrant city.

Of course nowadays it's a bit more empty and calm. When you go, for example, we have a huge airport that got shut down after the change in government. And this field is completely empty nowadays. And normally it's full of people and very a lot of people doing sports or celebrating there.

But yeah. And so at Deepset I'm responsible for innovation because we believe that there has been a lot of advancements in deep learning and also natural language processing. And this has to be brought to the industry. But also what is really important to us is getting NAP technology to work on German language.

And for this we are very deeply rooted in open source technology. We trained on a birch model, like these language models that got open source by Google. And we trained this on a lot of German text data and also open source to this. And this is giving us a lot of traction from the community, a lot of researchers are using this.

And yeah, this is just a really great time to contribute to all the source projects. Awesome. Yeah. Thanks for the intro, Tony.

You want to let us know a little bit about your background and how you eventually ended up where you're at now? Yeah, absolutely. Thanks for having us. So I'm Tony Reyna.

I'm a medical doctor and data scientist. And I'm a chief AI architect for Health and Life Sciences at Intel. So my primary role is actually taking artificial intelligence algorithms and trying to make them run faster on well on Intel products, obviously. A lot of what I've been doing has been in medical imaging space.

So CT, ZmRI, things like that. But been branching out into genomics and particularly natural language processing, so the NLP stuff. Tim and I first met with some of the German work that he was doing, which I like the playfulness of NLP. They named things after Sesame Street characters, like Bert Nerni and Elmo and things like that, which is kind of a fun group to be working with when you work with researchers that really love to have fun things to do.

It makes the logos a lot better. It makes the logos a lot better. Nobody forgets Bert now. Poor Bert.

Poor Bert. I mean, Bert's now a world celebrity now in terms of AI researchers. But yeah, no, it's really great. I mean, Bert's only been a couple years now, maybe even less, that has been in existence and has just kind of taken the field by storm.

So yeah, this project with Timo that we kind of looked at, since we were already connected, he just popped up on my LinkedIn page and said, hey, we're doing this COVID question answer thing. We'd love to get some help. And I said, well, let's figure out how to help out. So that's what we've been doing is obviously, we've been really busy in Intel just working as everybody is around the planet basically trying to figure out ways we can help with COVID.

And obviously, we're a tech company. So we're not health care. We're not going to be able to go out there and do magic, like health care providers are doing. But we're going to do what we can.

And this is one of the ways we think we can really make a difference. This is just such an unprecedented time. And it's moving so fast that for context, for listeners who are tuning in, we're actually recording this on Tuesday, March 31. And we don't normally say that in record episodes.

But given the topic and given how fast this is evolving, I thought that a point in time was worth having. Just to set the context, and then I'd like to come back over to you, Tony, for a little level setting for us. I know that right now we're at a point where there's 203 countries, areas, and territories that have COVID-19 cases. As of today, the World Health Organization said 754,000 and changed, pardon me.

I'm just to round out the numbers of cases. There's almost 37,000 cases around the world that resulted in death in the US. We're at 163,500 cases. And we are approaching 3,000 deaths, which we may hit today based on the current run rate, which would hear the US, which would put us on the same as 9-11 in terms of that.

So it's a moment in history that none of us have ever experienced. There's nothing, I guess, other than maybe the Spanish flu of 1918 that's comparable in any way. And I know that has limited comparisons. I'm wondering if you can kind of level set beyond just the numbers that I called out that are on the websites everyone is following.

What do we are today? What that looks like from your perspective as a medical doctor that's dealing with this? And then obviously, we'll talk about how we're using data to start attending to these problems that we're facing. Yeah, I think just from the medical aspect, I mean, I haven't practiced in over 20 years, basically.

So I mean, I wouldn't be the best person to answer about the clinical things. But this is the strangest part of this whole thing, is that when people ask me how is it going, all I know is because we're kind of locked down and stay in place and shelter in place. I can tell you how it's going in my house. Anything else I get from reading and from listening to the news and things like that.

And I think that's what's just kind of curious about this, is I feel like I want to be out there and doing things. And even my wife's a retired psychiatrist from the Navy. And she was actually even thinking about, should I kind of lend a hand somewhere? What can I do?

And I guess that's a great thing. But that's also kind of where we're at, is everybody wants to help. And yet it's this odd situation where the best thing for most people is just to shelter in place and make sure that we don't keep spreading things and trying to get it under control. That's a great point right there.

Yeah, definitely. And one of the things you mentioned, which really struck a chord with me, is the idea that we're all kind of sheltered in place, at least for the most part, a lot of people are. And we're getting information from various sources. There's so much information swirling around.

Some of that's recent, some of that's not recent, some of that's from trusted sources, some of it's not from trusted sources. We're hearing anecdotes from our friends and family. They're hearing things and things are getting past second hand. Could you talk a little bit, either one of you, about what you've seen in terms of the spread of accurate information and the problems related to actually information spread and the virus?

Yeah, I think Timo should go on that, because he's definitely the one that started this. I'm trying to get factual information out there. Yeah, exactly. So of course, I mean, social media is quite difficult to disemble, really truthful information.

And this is exactly how we started the COVID-19 project. It was two weekends ago, there was a Hackathon organized by the German government and authorities. It was actually a huge event, 45,000 people in one Slack workspace, all virtual. Yeah, and all remote and like a beehive buzzing about.

And part of this Hackathon, we decided to focus on getting factual information. And that's why we looked at official government pages. And already so quickly that if you look at the single government page, there's not so much information and that information needs is actually spread across a lot of official pages. And this is exactly the birth hour of COVID-19, where we wanted to aggregate these official sources and make them available and searchable in a meaningful way.

And yeah, this was during the Hackathon two weekends ago. And there were about 25 developers just jumping on to this project. We were five core developers from DeepSet, worked basically the whole weekend through. And with the support of this external people, it was really fun to develop the UI, to develop the back end, to develop scrapers, scrapes, and bring all pieces together.

And afterwards, afterwards, there's also people now interacting. But also people from external coming and wanting to collaborate, wanting to help, wanting to extend. And this is exactly how we're going to contact them with Tony and us for LinkedIn. And I think this is the most great part of this project to have a really like a community that is a fast agile, that is not bound to bureaucracy, that there's no improvement processes, like long improvement processes.

It's just we have a situation and we need to work on it. And this could help people actually saving their lives or their lives of their relatives. That was the nice thing about with the team as group, is that DeepSet was already set up to kind of do NLP and do it at scale. And so it was one of these things where I knew coming into it that A, they had something already in the first weekend that you could work with.

And B, they had the engineers and they had the data scientists that could make this thing scale. So they really just needed resources to kind of come in and help them to make it scale. But they had the machinery ready to go. So I'm curious, before we end, and I want to dive into that machinery here in just a second in terms of the end goals of COVID QA and its functionality and some of the things under the hood.

But before we get into that, maybe we could just kind of set a foundation in terms of, after you've looked at what sources of data are out there, what sources of information are out there, what they're talking about and what people are asking, what is the sort of information that people really need to know during this time? Is it symptoms? Is it best practices for hygiene and hand washing? What are you seeing as some of the main pieces of information that really need to get to as many people as possible?

It depends on, so I think it's two groups that we're getting at with this. The first group is just the lay person that's out there, and they're the ones that are going to hit the tool that as it exists right now, which is one that basically will sift through a lot of World Health Organization and the CDC kind of FAQs, and they're looking for, what's the best way to disinfect my house? Or what's the best way of watching my hands? Or can I eat this?

Or will this help? That's where it is right now. So that's kind of the first group of people that would be using this. And then what I thought was interesting was coming in to add the second group, which is going to be the researchers that want to look for new things.

And these are the data scientists and geneticists and physicians and epidemiologists that want to come in and actually do research on COVID and on coronavirus. And so one of the things that Tim and I talked about was there was a data set that was released on Kaggle by the Allen Institute, by the White House, NIH, Georgetown, CZI, MSR, it was a whole group that put it out, called Core D19, it was the coronavirus open research data set challenge. And it's something like, I want to say, like 25,000 PubMed articles. So these are peer reviewed, high quality articles that they basically just a search on coronavirus and virus and got all the articles basically.

And so the idea was, well, Bert and all of these great models have things called extraction AI, where you get to do a question and answer system for this large body of articles. And so the question would be, when the Kaggle thing went out, it was like, here's a bunch of data. Can you find interesting things to do with it? And I thought, well, the first thing you need with a mountain of data is a way to sift through it for actionable, relative data that's actionable.

And Tima's group had something called haystack, which was like trying to sift through a haystack for a needle. And I thought, what if we take this, we annotate it using the Stanford question and answer type of models, the squad models, and be able to actually have researchers give a free tool that researchers can go through the core data set and be able to type in random questions that are things that are not going to be how to wash your hands, but things that are like the beta sub unit of the globulin, of the such and such, whatever. And it will actually give you a relevant answer and a few articles, published articles, that you can actually look to and go through these 25,000 articles and get the real meat of the issue. Yeah, I also really like this and dual use of this project.

And to come back to the question, I think I looked at quite a lot of FAQs for the general public. They are the most important information, it's really informing people how corona spreads and how to prevent the spreading. I think if this is sitting in larger cities, where people are crowded together, this information needs to be then the right way and in a trustable way. I think this is really important.

And then this dual use for the general public, and then as Tony mentioned for the researchers, this will be incredibly useful to speed of the innovation process. Right? Right. Hi there, this is Daniel Weitnack, one of the co-hosts of Practical AI.

And when I'm not working on Practical AI, I'm developing my own AI applications or I'm training teams at other companies. I've been doing this for over 10 years now and I've trained more than 1,000 people. Now I'd like to invite you to my new, live, online training event called AI Classroom. In AI Classroom, I'm going to teach the practical skills I've learned over the years using the latest open source AI technology.

You will learn both AI theory along with practical hands-on implementations in both PyTorch and TensorFlow. After attending AI Classroom, you'll be able to understand the latest models, implement your own models in code, train computer vision and NLP models, create model inference servers, and experiment with state-of-the-art methods like reinforcement learning. AI Classroom is taking place this May. It'll be taking place live and completely online in a high quality virtual classroom, so no travel is required.

There will be two cohorts with convenient time zones for Eastern and Western hemispheres, so don't miss out. Tickets and more information is available at datadan.io. That's datadan.io. And early bird pricing lasts until April 3.

See you online in AI Classroom. So I guess coming out of that and into looking at the next layer, I'm wondering, we've talked about what COVID QA is, and we talked about it being based on the Core 19 data set. I'm wondering if at this point, now that everyone has a sense of what you're trying to accomplish, if you could dive in into specifically what this is that you're putting out there and making available to the public, and as we get a sense of that, we'll dive in to how it works, and what's the technology underlying it? Yeah, let's maybe also best then separate it.

This will use one for the general public. Explain this. I'm going to Modita and then also the researcher use where we mainly use extractive technology. Sounds great.

Exactly. So for the general public, it is basically matching user searches, user questions to the questions we acquired from the official FAQ pages, and the technology is based on all source technology, PyTorch, hugging face transformers, and also our other framework, case deck, that can basically do question answering at scale. And we started off with the question match matching in a very simple way. So basically, we just indexed the questions in Elasticsearch and incoming queries were then matched with this Elasticsearch index, which is basically just a rule based matching.

And we thought this is like a good baseline for people to continue working and developing, because it's easily extendable to other languages. Elasticsearch doesn't really care about the language so much that has been inputted, and it also super fast. And during the hackathon and the last days, we experimented a lot with a birch based embeddings. And if you would just subvert as a language model where you stick in text and you get then a vector representation of basically like a birch back was before words, it now works on sentence or document level to get like a document representation of it.

And these models, they really don't work so well when you just take the embedding and you compute similarities. Like for example, with a cosine similarity metric, they don't work so well out of the box. So you need to find ways to adjust these language models to suit your needs. And they're used a really nice library sentence transformers from a German NAP, a laboratory, a UKP lab.

New simers is also a main contributor there. And this basically takes a bird model and creates a clone out of it, like a Siamese network. So the weights are totally the same. You stick in the query that the user types in.

And on the other side, you stick in the questions that you have already crawled across. And then you get representations for both. And then you can compute a similarity metric. And this whole network is trained and to end with exactly this user questions and the questions you have.

And this works really, really great. Like the more data you feed into this network, the better it can match questions. And we've then also seen over the course of the hackathon that this is the way to go. And we need to extend this also to other languages because the questions from official FAQ pages are phrased in a very official phone.

And people who want to ask questions are more right in a colloquial manner or also they are spelling mistakes. And these models cover this by part quite well. This is why we are actually trying to push in this direction more, one more. So I'm pretty curious about that.

And Chris could probably guess that I'm very curious about that because of my interest in languages. Which we've talked about a lot. So you started talking about the sort of elk stack or elastic search matching with the index and then talked about Burt. So I'm curious, there's of course a lot of marginalized language communities out there that also need this sort of information and are only becoming even more marginalized because they don't have access to proper health information.

And I'm curious, so with that sort of flow that you talked about, you have a language model, let's say like Burt or a transformers model. So you could train this sort of model assuming you had data in the language. And then there's this sentence transformers and matching piece that you talked about. In terms of transitioning that piece and the training of that piece, you mentioned you have kind of your set of scraped questions and then the set of user questions.

Do you need annotation in terms of matching known user questions to the properly matched FAQ scrape data? Is that what you need for that certain piece? Exactly, this is exactly right. And we created manually, so the core team distributed a set of like 30 questions and we manually created a rephrasing of these questions to basically evaluate the models.

But now we also implemented a feedback mechanism into the UI and also we have a telegram bottom can maybe talk about the telegram about this one later. And there people can actually give positive feedback or negative feedback saying that maybe the content is irrelevant, it doesn't match the question or the content is outdated for example, to inform us that we have to adjust our scrapers in a way. And this we hope will scale to other languages and all the data that is coming from this will be open sourced in this COVID-19 repository and will make this also available to other researchers that they can improve a question matching for COVID related questions. This is super cool.

I love what you're doing on this. I guess one of the things I wanted to ask is maybe as people focused on these technologies and we're doing this day to day in our life and the efforts that we're engaged in in our own projects are built on these data sets that we have. And yet as we look at this crisis and we're looking at the fact that the data set may or may not have everything you need, which kind of alluded to there, in terms of how applicable it is and getting that feedback, what kind of strategies are you thinking of in terms of being able to provide the outputs that maybe some users are needing if they're not in the core 19 data set inherently? Is that just a limit?

Is that a hard limit of what you can tackle or have you thought about how to extend beyond the limitations of the data set you're working with? So exactly like this is in the second stage where we have a more an extractive QA that takes some unstructured text database like the core 19 data set, for example, and then extracts question. We think that this will be related to researchers, but we could also envision that more text that the general public would be interested could be searchable with this system. The only problem there is that these extractive QA mechanisms are incredibly hard to scale to a huge amount of users.

So we would possibly do this for the general public in more like an offline way where we collect questions and if a lot of questions come up that cannot be answered, we might need to use these extractive QA models to answer them from different data sources. Yeah, so just to kind of follow up on that, I think that it's worth noting here that I think it's really cool how you approach this because there is existing sort of question and answer data out there that's from a trusted source. So let's say FAQ pages from the World Health Organization or something like that. So in the first case where you're talking about doing this matching with the transformer models, you're actually matching a user query to a trusted source answer for that question because it was posted on an FAQ site.

But then in the second piece, what you're just talking about in terms of extractive QA, really now what we're talking about is saying, okay, well, and correct me if I'm wrong, but I think the goal here would be to say, well, the user isn't asking exactly what's on the FAQ site or maybe the user isn't sort of general public user, but they're a research user. And like you said, Tony, they wanna know a very specific question. It's not on like a FAQ site, but it is in some research article or it is on some trusted source page. So this is like totally unstructured data.

It's just like an article. And you wanna ask some random question about that article. And that's where this extractive QA comes in. Tony, could you maybe comment a little bit about that and how this sort of extractive QA model is maybe different from the sort of embedding matching that we're talking about in the other case?

Yeah, sure. So I mean, the way these models typically work, they talk about a language model. I usually think of it, it's learning the statistics of a language. So it's effectively like learning I before E except after C or learning that if the way I kind of go over it is, if the Dewey decimal system is this random alpha-numeric number system, but you give it to any librarian they're able to take very different books and be able to basically place them correctly in a library and in order to kind of structure.

And so a lot of these models are doing the same thing. They're basically just looking at patterns of word co-occurrences and statistics of how words occur. And what we're trying to do here is for the models that have already been trained, they're usually trained on things like Wikipedia or English language, German language, text, which is great, but for these sorts of things you really want to get into the domain specific terms. So the medical terms, the genomics terms, more difficult and more infrequent terms that won't be showing up in Wikipedia and trying to learn the statistics of that dataset.

So they have existing things like cybert and biobirt, which are Burt models that are built using things like the BioASQ dataset. And so, you know, Timo and I kind of had talked and we said, what if we took the CORD-19 dataset, which is supposedly, we've looked at it now, there are mostly coronavirus articles, but there are lots of other articles as well in there. So the first, you know, rule of data science, the data's always gonna be dirty coming in. So what we did is we said, I'm an MD and Intel has a lot of contacts.

And so I contacted people from the American Medical Association and people that I knew and just basically put out a call and said, hey, could we get some domain experts, physicians, nurses, PhDs and biosciences, people that are probably, in some cases, sitting at home. I heard on the radio that third and fourth year medical students, you know, are being told to kind of stay home. And I was a third year medical student. I mean, I know how difficult it is.

You wanna be there. You wanna be doing something. You're incredibly intelligent and you have all of these skills that you've spent the last two years doing. So I just put out a call and we set up a Slack channel and Timo's group had, you know, DeepSet had this annotation server.

So we put up this CORD dataset and it's essentially the Slack channel allows, I think we've got like 24 on the Slack channel right now. We just started yesterday on the annotation and right now we've got over 100 question answers off the dataset just in the first day. And so these are things like, you know, I'll read you some of the ones I'm looking at now from the website, you know, how many amino acids are in the SARS-CoV protein? And the answer is 76 amino acids.

And so this is something where, you know, Wikipedia is not gonna be able to get you there. These are directly from the articles. It has a link back to the article that you're talking about, things like what does the SARS-CoV protein activate? NLRP3 and Flamasome.

Again, very, very detailed kind of question and answer things that are either specific to viruses or specific to epidemiology or specific to SARS itself or COVID or MERS or any of these kind of similar pandemics that we've seen. And the nice thing for the domain experts is they just log into a website, all you need is a web browser and internet connection. And as long as they can highlight some text, they're good to go. And so we just throw them at it, give them, you know, some walkthrough videos and let them go in and annotate.

I'm kind of curious as you're talking about the specifics of amino acids and such. As the, I know that China had done a complete genome of this virus fairly early on and published it. Have you been using that? Has that been helpful?

Has that informed any of the work that you guys have been doing in the project here? Not for us. So what we've got so far are just the published articles that were on like PubMed that were pushed into the core data set, but it's certainly something that we could add into things. So kind of the mechanics of how these question and answer systems work is that it's kind of the annotator kind of goes backwards from the model.

The annotator reads through the article. So this is a published, you know, peer-reviewed article. And the annotator comes across a certain fact and they say that's interesting. That's something, you know, that's specific that might be interesting.

So they highlight it. And they click question and they make up a question based off of the highlighted text in the article. So if they had the genomic sequence or something like that, they could certainly, you know, if that were in the text article, that's something that they could definitely highlight and make up a question to. And then the great thing about this is, you know, when Timo sighed when he, you know, creates an extractive AI model for it, it could actually extrapolate and then say, okay, I understand the context and the statistics.

And so if you threw me a new question that wasn't something that the annotators had ever come up with, it should be able to do a pretty decent job at kind of figuring out what it's looking for in that article. So if I put up a brand new article, if I put up a genomics article to this website and ask some questions, it should know where to look for in the text. And that's what's coming back is the model is not just making up words, it's identifying, it's highlighting the text and saying, here's the highlighted answer in the text. Does this seem right to you?

The change log is deep discussions in and around the world of software and it's been going for over a decade. We interview hackers like Chris Anderson from 3D Robotics. The time drones were like predators and global hawks and military industrial, they were classified and super, you know, $10 billion things. And we just built drone with Lego pieces around the dining table programmed by a nine year old.

And it's like, okay, that should not be possible. You know, it's not, when a nine year old can do something, it's classified that literally export control is in the ignition with Lego with toy pieces, it's something important in this world is change. Leaders like Devin Zugel from GitHub. In the like 10 to 15 year range or 20 year range, what I would really like is for if you have like three, 12 year olds hanging out and one of them's like, I want to be a firefighter, another one's like, I want to be a lawyer, I want to be an open source developer.

And innovators like I'm out who's saying. I've yet to kind of see applications that scale that don't use multiple languages that don't have just arcane stories behind why this weirdo thing exists, you know? Like, all right, when you open this file, you're going to have to turn around three times and tap your nose once. Like it's just the most hilarious stories, you know, but applications are living, breathing, they have craft, that's normal.

So I want to normalize weirdness because that's just how applications evolve over time. Welcome to the change log. Please listen to an episode from our catalog that interests you and subscribe today. We'd love to have you with us.

We kind of talked about the QA annotation that Tony has kind of helped spin up and really utilizing that expert input from doctors, from medical students, from medical professionals on the core 19 data set. I was curious to kind of push that back to you, Timo, and see what your thoughts are in terms of, let's say that Tony was able to get all this annotation in place and sells like there was a great start on that. How do you see that being integrated into the COVID QA system itself? And maybe how do you see the two sides of the COVID QA system developing?

Yeah, exactly good question. So I would say it starts with a scale of data that is needed. So for this question matching, we can either use already pre-existing external data sets. And then this matched questions, maybe one or 2, 1,000 per language.

This is enough to get a really good matching system already going. But for this extractive question answering, there's much, much more help needed. Those scales are there like these common data sets. A squatch from Stanford is in 150,000 question answer pairs are there.

And that's why we think it's really great to have external help because this will be the largest level, getting this data in a format that we can then use in the frameworks. This framework, we actually use Haystack, which is basically enabling question answering on a larger scale. So normally, if you just use a language model like Burch, you basically take a small paragraph, like maybe 2, 3,000 characters, and you ask a question and you compute this. But for a large document base, this would be very infeasible.

And then you need a two-stage process. In the first stage, you pre-select documents that could be relevant with a very cheap and very fast solution. And then you apply more power for models like that. And this is definitely not a real, it's definitely not a new invention.

For example, there's a framework out from Facebook. It's called a Dr. QA. This is exactly when there's a retriever and then a reader architecture.

But Haystack is doing it in a bit more modular way and modern way with a Burch-based extractive question answering system. And we think there's a huge gain in performance. And so we can take these labels that Tony and the collaborators produce and stick them into frameworks to train end-to-end systems that answer questions on this large coordinating data set. So I'm curious.

You mentioned the scale of the data. As we've SIL's been working to get translations in place over the past days, definitely translations in the thousands seem within reach. Annotations in the hundreds of thousands is definitely a tough thing, especially when you're relying on experts. But I was wondering if you could speak to, I know some of these domain adapted models, so cybers or other ones.

Do I have it right that those are transfer learned from another model? So if you have a model trained on the squad data set for question and answer, which is totally general domain, is it possible to then transfer learn a domain adapted question and answer model with the data that Tony's working on? So it's a little bit different. You have to separate a base language model that can just transform text vectors.

And then you have to take this language model and adjust it to suit your task. And for example, document classification or extraction of named entities like a person's cities. And also question and answer, we have to attach a prediction and so another small neural network on top of this language model and then train this whole joint network on this target as and models like cybert or biobert. They are just pre-trained on a large biomedical corpus.

What we are also doing right now is we took a bird on English data and adjusted it to them because this process of adjusting this language model to a domain is not that computationally expensive. For example, training the whole network from scratch. So we took this network adjusted it to the core 19 dataset. And if we take this adjusted to core to scratch, let's call it like that and stick in the data, the labels, the question and some labels, it will hopefully perform better than just a plain bird to try to name it.

So I guess how would it be useful? Would it be useful to get more annotators involved in this? And if so, what types of skills do you need with annotators to make it useful for them to apply annotations to the data set? Is who can do that and how many more people would be helpful?

As many more as possible. It would be like to be a simple answer. Well, we're primarily looking for our people that are masters level or above in the biomedical sciences. So you don't have to have a PhD to do this.

But I would like to see someone who is comfortable in reading an academic paper and being able to explain it to someone else or be able to point out the salient points. Some of the other things that are useful, even if you're not one of those expert annotators or you're not sure if you're an expert annotator, are just proofreaders. So we've got in the Slack channel a sub-channel called Second Opinion, just like the medical jargon. And so Second Opinion is where we have somebody that just is looking through the current answers.

And the current question answer is going, hmm, I wonder if that seems quite right because that doesn't seem to make sense to me. And so they'll put it up and say, hey, question and answer, 2, 3, 4 is kind of weird. Can I talk to the person who annotated that? Or can I talk to somebody else who might be able to give us a yay or nay, whether that's a good annotation or not?

So things like that are always useful. And I'm getting good response so far, but I'd love to always get more as we talked in the beginning, just people that are kind of at home trying to figure out things to do. Again, we've got geneticists who might not be doing anything right now. We've got people with biochemistry degrees that maybe they're not doing anything right now, or maybe they're grad students.

Or perfectly would love to talk to them and try to onboard them for this. If you have an internet connection and know how to use a web browser, you're set. I have a daughter who's a third year med student. So I'm definitely going to.

Oh, perfect. She is camera contact me. Absolutely. We'll get her going today.

I'm definitely going to bring her attention. We made it 25, at least, on this. Excellent. We made it to 25.

And what's the best way for those sorts of people to contact the effort and get onboarded? Is it the best way through the GitHub repo? Or is there another way to do that? Yeah.

So for the programmer, it's the GitHub repo. And then for the domain expert, it's probably going to be the Slack channel. And we can put the Slack channel up on the site. And we're keeping those two communities separated, so one doesn't get freaked out by the other.

So we'll keep the coders on one side and the biologics on the other side. And we can also add that into the show notes, which I think is what Daniel's about to say there. It makes it easy for them to slide through to this. Yeah, to click on it.

Yeah. Yeah. Well, definitely get the Slack team added to the show notes. So if you're listening and you want to be involved in the annotation, take a look at those show notes and make sure you reach out.

I was curious on the circling back. So that's a great way to contribute from the research user side of things. In terms of scaling up the general public use case of COVID QA, it sounds like there's definitely still some needs there around language support. And maybe that has to do with maybe Timo, you could mention what would be best to add there in terms of maybe it's scraping more information, more FAQs or adding them.

And then also on the development side, where your biggest needs right now in terms of making COVID QA really useful for the general public, what are some needs that you have, whether that's front end development, maybe not even AI related or maybe it is AI related. I'm really great that we are separating labeling process and developing process because it will get super complicated. And I also wanted to thank Tony that he's taking also a lot of initiative for supervising and pushing the label process because I think I've never heard of an open source labeling process. And I think this will be a mess at some point and it will be very complicated to interact and also to supervise the quality of the labels.

But I think this is the great part of the strength we just have to try and have to make it work. So this is really great on the development part. On the development part for computer scientists of any sort, there's a lot of help needed. So this is just a hackathon project.

It's not like a full fledged professional industry solution. So we need a lot of help. I'm in contact actually with a data scientist also from Interith who are a contact from Tony. And she's working on an intelligence scraper.

So right now we have very manual scraping processes for each page that adjusts to the HTML structure. And we would need more intelligence scraper that you can just point to an FAQ page and it automatically extracts questions and to the answers. Of course, there will be errors, but I mean this is a little bit unavoidable. Maybe we can do a review process afterwards.

So this intelligence scraping will be extremely helpful. And then you also mentioned then that bringing this question matching to other languages. This is something that is personally very important to me because I think this will create the biggest societal impact. And there's a lot to do because for now we have the question matching algorithm with sentence transformers and bird just implemented for English, but making this work for other languages with a multi-lingual language model, for example, with this cross-lingual language model open source by Facebook, for example, this would improve the experience a lot.

This on the modeling side and then also a huge help we need on getting this actually to people. And after the hackathon, we got contacted by a person. I don't even know his or her person name. It's the Apache 64 and this person just programmed a telegram integration.

So this service has an API where it can match questions and you can call this API and he integrated this into telegram and this bot is just working also, we integrated the feedback mechanism to feedback the user information back into our system. So this help is really appreciated. But what I think could be important there would be maybe a WhatsApp integration and maybe even if we extend this really to low resource languages where people might not have access to mobile phones with internet, maybe have like a text message interaction. But this would be a little bit further away, I would say.

So I guess as we get toward the end here, I want to ask a question and I'd like each of you to give me your perspective since you're coming from two different places on it, is we look, we're in this global crisis, which is unique and has stressed all of us and forced us to think creatively in ways that we have just never done before. It's sort of like living in a science fiction novel to some degree. And so as you guys are looking at the role of artificial intelligence of the world within the world and we're looking at suddenly we have this crisis upon us, how do you see artificial intelligence technologies and data technologies impacting our way through this crisis at large, not just the project that you're in, but it's kind of role in the larger world. How has your perspective possibly changed over the last few weeks with regard to that?

And what opportunities do you see as the most exciting in terms of the path forward now that you are involved in this and seeing the results that you are? Timo, you want to go first? Yes, totally. So I would say this will like large in the way corporates contribute to like a solution that is helpful for everybody.

For example, DeepMind has announced quite early solution where they basically analyze the molecular structure with reinforcement learning. And if basically through interdisciplinary collaboration that is now made much, much more possible, less bureaucratic and very fast and agile, I think a lot of great solutions can emerge. And I also think that a lot of corporations give their employees actually some dedicated time to work on these solutions. So like a collective effort of everybody around the world to work on something that is not directly related to making profits but to solving this crisis.

And I think this is something unique as you said. It's like a unique situation. And yeah, it will hopefully make people collaborate a little bit closer on things that are relevant for society. Yeah, so I loved what Timo said about trying to do things that are relevant to society.

I'm not an official and tell speaker, but I can tell you that very creative people that are allowed to do a lot of things kind of what interests us in addition to our usual web-based builds. This has actually been kind of interesting that now the things that are interesting the whole company, in fact the whole world basically now, all the extra stuff is now going toward what can I actually do? In terms of the AI stuff, I kind of go back to we're sheltering in place and we're trying to get through kind of the scale without being connected and somehow AI is kind of helping us to get through the mountain of data that's coming in and trying to maybe focus us a little bit better. I mean, it's designed to be a tool.

It's not designed to replace anything. It's designed to be a really, really nice way of sharpening the edge to figure out exactly what we want to do and what's possible. So that's where I see the AI coming in. Awesome.

Well, we appreciate you both taking time to join us today. I know that especially during this time there's a lot to work on. So thank you for taking time. And definitely if you're listening out there and you are wanting to contribute in a positive way, using your development skills, using your AI skills, using your health knowledge and your medical expertise, please check out this project.

The links are in the show notes and reach out as well. If you're having trouble figuring out how to get connected, there's our Slack team as well, which you can find at changelog.com slash community. And we're happy to get you connected to Timo and or Tony and their team. So just make sure you get connected and contribute.

And thank you both Timo and Tony for joining us. Really appreciate it. Thanks, Dan. Thanks, Chris.

Yeah. Be safe, stay healthy. Thanks for inviting us. Yeah.

Thank you for listening to this episode of Practical AI. More like this at changelog.com slash Practical AI. There you'll find our latest as well as lists of our most popular episodes and the ones we recommend. If this show has helped you on your AI journey, please leave us a five star review on Apple Podcasts, part us on Spotify, star us on Overcast, and telefriend with their missing out on Practical AI, as hosted by Daniel Whiteneck and Chris Benson, is produced by me, Jared Santo.

And our music is brought to you by the beat, breakmaster cylinder. We have awesome sponsors. Please support them. They support us.

Thanks again to you in the Fastly, Linode, and Real Bar. That's all for now. We'll talk to you next time.

Share this episode

Similar Episodes

Milk Proteins without the Dairy - Adam Tarshis and Dr. Cory Tobin

Jun 9, 2026 ·50m

New Technology in Severe Burn Care - Dr. Katie Bush

Jun 1, 2026 ·31m

New Methods in Early Cancer Detection - Dr. Nate Montgomery

May 25, 2026 ·39m

Strategies in Combating Chronic Kidney Disease - Dr. Salvadore Viscomi

May 17, 2026 ·37m

AI and the Future of Healthcare -- Dr. Emilia Javorsky

May 8, 2026 ·39m

The First Environmental GE Organism Release - almost! Dr. Steven Lindow

Apr 28, 2026 ·25m

Similar Podcasts

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 54 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on April 6, 2020.

What is this episode about?

So many AI developers are coming up with creative, useful COVID-19 applications during this time of crisis. Among those are Timo from Deepset-AI and Tony from Intel. They are working on a question answering system for pandemic-related questions...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.