Exploring the COVID-19 Open Research Dataset

What this episode covers

In the midst of the COVID-19 pandemic, Daniel and Chris have a timely conversation with Lucy Lu Wang of the Allen Institute for Artificial Intelligence about COVID-19 Open Research Dataset (CORD-19). She relates how CORD-19 was created and organized, and how researchers around the world are currently using the data to answer important COVID-19 questions that will help the world through this ongoing crisis.Sponsors:Linode – Our cloud of choice and the home of Changelog.com. Deploy a fast, efficient, native SSD cloud server for only $5/month. Get 4 months free using the code changelog2019 OR changelog2020. To learn more and get started head to linode.com/changelog. AI Classroom – An immersive, 3 day virtual training in AI with Practical AI co-host Daniel Whitenack. Get 10% off using the code PRACTICALAI10. To learn more and purchase tickets go to datadan.io. Featuring:Lucy Lu Wang – Website, GitHub, LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:Lucy Lu Wang - Google ScholarKaggle: COVID-19 Open Research Dataset Challenge (CORD-19)CORD-19 ExplorerSemantic Scholar | CORD-19Allen Institute for AIUpcoming Events: Register for upcoming webinars here!

of MATCHES

TRANSCRIPT · AUTO-GENERATED

The scientific engine has really spun up to handle this current situation. As far as I know, there's been more than 4,000 papers released since January on COVID-19. Wow. And that the number of papers continues to grow, but more importantly, the number of papers released every day continues to grow.

So we're up to maybe more than like several hundreds of new papers a day. Being with her change log is provided by Fastly. Learn more at fast.com. We move fast and fix things here at ChangeLog because of Robar.

Check them out at Robar.com and we're hosted on Linode Cloud servers at lino.com slash ChangeLog. Do not underestimate the power of the independent open cloud for developers. Yes, I'm talking about Linode. Linode is our cloud of choice and it's the home of ChangeLog.com.

What we love most about Linode is their independence and their commitment to open cloud. Open cloud means being unencumbered by outside investment and maximizing value for the community, not shareholders. And that's exactly what Lin represents. No vendor lock in, open at every layer.

If you want to learn more, head to lino.com slash open again, linode.com slash open. Welcome to Practical AI, a weekly podcast that makes artificial intelligence practical, productive and accessible to everyone. This is where conversations around AI, machine learning and AI science happen. Join the community and Slack with us around various topics of the show at ChangeLog.com slash community and follow us on Twitter if you're at Practical AI event.

Okay, take it away guys. Welcome to another episode of Practical AI. This is Daniel Whiteknack. I'm a data scientist with SIL International and I'm joined as always by my co-host Chris Benson, who is a principal AI strategist at Lockheed Martin.

How are things down in Atlanta, Chris? Doing very well down in Atlanta. Got a bit of a cold, so I may cough my way through the episode, but other than that, doing great. Spring is sprung, it's beautiful.

Hopefully just a cold. Yeah, we're just crossing our fingers. Took my daughter a couple of weeks. I'm hoping it was so cold.

Took her to a pediatrician a couple of weeks ago. We actually had to go in because it kept going and it was a frightening thing to say, well, it could be strep, could be this or it could be COVID-19. We can't exclude that. And as a parent, that was like, whoa, so I've gotten through that.

Just a cold. We're good. Good. Good.

Good. Rolling forward. Well, we're surviving here in lockdown in Indiana. It's actually pretty nice outside.

It's mushroom season here. So there's these wild mushrooms that come out in Indiana just around this time. They're called Morel mushrooms and we go every year hunting. So my family has some property that's all forested.

No one else is there. So we've found some good times just going out there and walking through the forest and getting outside. So that's been nice. Yeah, that sounds nice.

Whether you find any mushrooms or not. Exactly. Well, I guess a related topic because it's really the only topic these days affecting us in all of our lives and in a big way. We've had another episode a couple of weeks ago about the COVID-QA system, which is a question answer system related to COVID-19.

And they were also using this data set called CORD 19. And today we've got Lucy Luong from the Allen Institute for AI. She's a research scientist there. And we're going to be talking all about the CORD 19 data set, the ins and outs and the story behind it.

So welcome, Lucy. Hi, thank you, Daniel and Chris for having me on the show. Yeah, it's great to have you here. I appreciate you joining us.

This is of course a big topic, everyone on Twitter and all around is talking about this data set and how it's being used. So we're really excited to talk about it a little bit more here. But before we do that, I'd love to hear a little bit about your background, how you got into AI related things and ended up at the Allen Institute. Sure.

Yeah. So I guess my background is maybe a little less traditional. I started out more in kind of biomedical engineering and physics and worked in a host of biomedical startup companies on creating medical devices. And over time, I was doing more simulations and incorporating more data science and machine learning techniques into my work and found that that was what was very motivating for me.

So I decided to pursue a PhD in biomedical informatics where I focused primarily on biomedical applications of natural language processing techniques and creating models to try to connect these automated methods with the type of improvements in clinical care and biomedical text mining that we so desperately need these days. Yeah. So when you're talking about NLP for biomedical applications, are we talking mostly here about medical records and doctors' notes or whatever that is and trying to extract relevant information from those and patterns and mind those for useful things? Is that the main sort of drive there?

That's definitely one aspect of things. I am also very interested in looking into the scientific literature and trying to extract entities and relationships and useful information out of that body of work. And I think that's what really I'm working on at the Allen Institute for AI or AI2. I'm part of a team called Semantic Scholar, which I think a couple weeks back you had an episode about Semantic Scholar.

And it's a literature search engine project. And for Semantic Scholar, we've indexed 180 million papers. There's a really rich corpus of text to work with. And as part of the research team there, I've created a number of tools and work on a number of projects to understand more about the content of that text.

And that's kind of what brought us to the Core 19 dataset is we have this underlying infrastructure for processing scientific text. And we were asked to contribute some of that expertise to creating the dataset. Awesome. Yeah.

And I'm curious. So, yeah, I think the scientific engine has really spun up to handle this current situation. As far as I know, there's been more than 4,000 papers released since January on COVID-19. Wow.

And that the number of papers continues to grow. But more importantly, the number of papers released every day continues to grow. So we're up to maybe more than like several hundreds of new papers a day. Wow.

And it's kind of intimidating to look at this source of information and kind of see what people are discovering. So the fact that you have so many new coming in every day, are you refreshing the dataset? Is the dataset static at some point in time? Or is it something that you're constantly updating and refreshing?

Yeah. So I guess maybe folks already know what the Core 19 dataset is, but it's a kind of a collection of papers about COVID-19 research, including historic coronavirus research. So we have a collection of historic research, and then we also have all the new research that is being released daily. We update the dataset currently at a weekly cadence, but we are rapidly moving to a daily cadence since there's just so many new papers released every day.

Yeah. So since we kind of went that direction, I was wondering if you could maybe tell a little bit of the story of how this dataset came about. I mean, obviously you have data about scientific literature within semantic scholar and you're already doing certain things as relating to tracking entities or topics covered in those. How did the idea for Core 19 come about?

And I know that there's others involved in this too. So there's Allen AI, but there's also, I think, Microsoft and the Chan Zuckerberg Foundation and others. So how did this come about? Yeah.

So the entire project is kind of a coordinated effort by the White House Office of Science and Technology Policy. And I think sometime in early March, a group at Georgetown, the Center for Security and Emerging Technologies, Georgetown CSAP, they reached out to us at Allen AI to help coordinate the release of this dataset along with a couple of different organizations. You mentioned MSR, Microsoft Research, Chan Zuckerberg. Taggle was also involved in the National Library of Medicine, which is part of the NIH.

So all of these groups were going to come together to essentially create this dataset to help, I guess, create text mining and information retrieval tools that could assist medical experts in understanding more of what was going on with the epidemic. And for Allen AI, the way that we got involved is we had recently sort of created a new pipeline to revamp our research corpus, our open research corpus. So we had a pipeline for essentially taking these paper documents, which are traditionally in kind of a PDF format, not very easy for text mining, not very accessible, and converting them into kind of like a structured full text format where you could run these natural language processing models on them more easily. So that's sort of our major contribution to the dataset is the pipeline for both harmonizing the paper metadata that we've collected over the years and also producing these structured full text parses so that we can run our models over that text.

And I know one of the big things that we talked about when we talked about semantic scholar was the ability to kind of find relevant data that might be buried in the wealth of scientific literature that we have about a certain subject that is of interest. So when you came to Core 19, there's the extraction of the metadata, the actual content of the paper, but then how do you even go about saying like these are all the papers related to coronavirus? I mean, I know, and I'm a little bit ignorant on this subject, so you'll have to forgive me. I know that coronavirus is kind of a family of things.

It's not just this COVID-19, which is associated with coronavirus, but there's coronavirus associated with the common cold and all these things. So how do you go about saying like this is what we're scoping down our data set to and finding that and along with that, deciding what you're going to exclude, I guess, as well? It's a great question. And I think it's a question with very open answers.

So what we started with were a couple of trusted sources that we knew needed to be included in this data set. And those sources were a collection of papers curated by the World Health Organization on COVID-19 specifically. And we also performed searches over PubMed Central, which is a biomedical paper repository run by the NLM, the National Library Medicine, as well as these preprint servers, BioArchive and Med Archive, which were publishing the latest research on COVID-19. And we went out and collected papers from these sources using a set of essentially the keyword searches to make sure that they were relevant to both COVID-19 or the family of coronavirus in general.

Because I think historical coronaviruses like SARS and MERS are also extremely relevant in the current case. I'm curious, as you reached out and made the data set available and you look across some of your partner websites like Kaggle has a call to action and stuff, and you're trying to get AI practitioners and data scientists to focus on important questions that need answering for this purpose. How do you provide guidance in that way for people who are getting engaged on the data set? Is it something where people just grab it and do whatever they want?

Is there any kind of organization across teams? There's a lot of human factors involved in this. How is that conceived? Yeah, so there were a lot of challenges.

For Kaggle, when we opened the challenge initially, the CORD19 challenge, there was a set of kind of 10 slightly open-ended clinical rel questions which were given to the community. And the engagement at Kaggle, the response we've received, has been absolutely incredible. There's been millions of views on the landing pages that data set has been downloaded so many thousand times or more. There's been lots of teams that have cropped up and self-organized to work on this data set.

I think there's a group called Coronavai that's several hundred data scientists and experts who have bonded together to work on the CORD19 data set and other Coronavirus data sets. And we really want to just offer support to these community members. So there's a couple of sources of information that we've created to help facilitate these things. So on Kaggle, the forums have been super active.

There have been a lot of people answering questions for each other, including from the organizations that have created these data sets. We've also established a discourse to answer questions specifically about the CORD19 data set. So that's a great place to get answers. I think for these Kaggle challenges and for these shared tasks, one of the things that we're really trying to do by hosting these shared tasks is to connect ML experts with the medical community and experts who can judge the answers that are being retrieved and extracted by these machine learning experts and see whether they have practical application in the clinic.

So that's been a challenge. What's up? This is Daniel Whiteknack, one of your practical AI co-hosts. And I hope you're enjoying this episode and staying healthy during these crazy times.

I'm working on some pretty cool AI stuff here from my home office, but I've also found that I'm having to get a bit creative and be intentional when it comes to honing my AI skills and virtually connecting with the AI community. If you're in a similar situation or you've been inspired by the practical AI we talk about on the show, I want to invite you to a live online AI training event I'm hosting this May called AI Classroom. In AI Classroom, I'm going to teach you the practical skills I've learned over the years using the latest open source AI technology. You'll learn AI theory along with practical hands-on implementations in both PyTorch and TensorFlow.

And after the training, you'll be able to understand the latest AI models, implement your own models in code, train computer vision and NLP models, create model inference servers, and experiment with state-of-the-art methods like reinforcement learning. AI Classroom is taking place this May. It'll be taking place live and completely online in a high-quality virtual classroom, so no travel is required. There will also be two cohorts with convenient time zones for Eastern and Western hemispheres.

Don't miss out. Tickets and more information are available at datadan.io. That's datadan.io. And practical AI listeners can use the code practicalAI10 for 10% off.

See you online in AI Classroom. So Lucy, you just brought up something that I think is really interesting, which is the sort of interaction between the AI community and the medical community. And I was actually wondering while you were talking about like, okay, this Core19 dataset exists. But I know I have some AI expertise, but I don't necessarily have a lot of medical expertise outside of knowing that I should wash my hands and these other things that kind of the top five that have been going around.

I guess I was wondering, as you've got more experience with this kind of intersection between the AI community and the medical community, what has that interaction been like in the past? Has there been much overlap between the AI community and medical practitioners? And then secondly, as we enter into this new Core19 challenge, has that changed in any sort of way or been rapidly advancing in any sort of way? I've sort of worked on the intersection of these communities for a number of years.

And I think there's a lot of great collaborations going on. I think a lot of folks in the computing community are incredibly motivated by these very practical questions that need to be addressed, ways to improve patient care, ways to help with drug development or vaccine development and the questions of these nature. Just for kind of COVID-19 specific initiatives. So I can kind of give you two anecdotes for ways that we've had annotators or I guess medical experts interact with computing experts.

So for the CAGL challenge, it seems that what is happening is a lot of people are developing different systems, different information retrieval, different information extraction systems. And those systems need to be reviewed by experts for usefulness. So in CAGL, there's essentially kind of like an army of medical students and other people who are willing to provide their medical expertise and volunteer their medical expertise who are actually going through and manually reviewing a lot of the extractions that are coming out of these CAGL challenges and creating these kind of living systematic review pages with the answers to some of these questions. So if you go to the CAGL page, you can see these reviews kind of being created in real time and updated in real time as new literature is released.

And another thing that I've been involved in lately is we're kind of running a track challenge on this data set. So track is the Text Retrieval Conference and it's been a project that NIST, which is a national institute of standards and technology for the last 20 years. And these folks are really good at information retrieval and judging information retrieval systems. And the way that these systems are judged is by having expert medical annotators review all the results and provide gold rankings of what is most relevant to query and what is the kind of least relevant.

So there's a lot of this like how incorporating experts in the loop, incorporating humans in the loop to kind of bolster our machine learning systems. And that is not something that we're going to be moving away from anytime soon. Yeah. As you're talking here, I'm looking at the various questions there that are listed on CAGL, the task to go answer and kind of extending this thing about this collaboration between the AI community and the medical community.

The questions themselves, where do they originate from? How are they decided as the important questions that we could all get a shot at going and answering with the data set? How is that? Yeah, I may be wrong, but I believe this set of questions originated from the White House Office of Science and Technology Policy in collaboration with CAGL.

And you have to understand, so this challenge and the data set during the early days, we literally had just a few days to turn around this data set, put it out there and kind of published this challenge. We wanted people to start looking at this as quickly as possible. So a lot of the questions that you see on CAGL right now are very open-ended. They can be interpreted in different ways.

As time has gone on, as we've learned in this last month, actually some of those questions are more useful in clinic. Some of those questions are less useful. Like clinicians already know the answers to some of these questions. So now as we move into the second month of this challenge, there will be a new batch of questions released to kind of motivate new work and questions that have not yet been answered by the community.

Yeah, so I'm curious with that. It seems like you could have various bottlenecks in this situation. And one of those I think you highlighted is this sort of useful interaction between medical practitioners and the AI people that are trying to do something with the data set. So I was wondering, do you have a sort of healthy community of medical practitioners that are very deeply involved in kind of looking at what's coming through CAGL or these other teams that are kind of self-organizing?

Because I know that's one, having worked at a nonprofit for a bit. One of the things I've seen is people really get behind the social good challenge and they work on it like on a hackathon on the weekend and then the project kind of dies. So how are you kind of, what are the ways for people, if I want to step into some core 19 related work, what are the ways that I can kind of step into that but also get connected with the right sort of medical people to make sure that what I'm doing is useful and not just like an interesting weekend thing? Yeah, of course.

I think like now that COVID-19 has sort of taken over all of our lives, a lot of people are feeling very motivated to do something in this direction, contribute their skills. And I think I mentioned some groups earlier, groups like Corinna Y, which is self-organizing to analyze this type of data. And that's a group made up of data scientists, machine learning experts, medical practitioners, and so on. And so it's like a good place to get feedback on one's work.

In Kaggle forums, similarly, I think there are a couple of threads out there, essentially continuing this discussion of how to connect to medical experts, how to verify results and so on. A lot of people have taken it upon themselves to build systems. There's tons of systems out there searching for extracting information out of core 19. And to be perfectly honest, not all of those systems are going to be highly usable or used by any kind of clinical audience.

But we, as a community, need to essentially figure out which of those systems are most promising, figure out where to expend additional energy, like more development time and so on. And that's where our annotator is coming. Going back to what you said a moment ago, you have put this together so quickly and gotten it out there and the whole world has kind of dived into the whole world of data science at least. And that's very different from how most communities form.

And I'm kind of wondering, with hindsight, totally recognizing that you had a no choice, you had to get it out there and you did a fantastic job with what the kind of pressure that you were under. And knowing what you know today, what are some of the things around community that you would like to have done and that maybe going forward as you're looking at the tasks and how to move us going into the second month, how to move us in the next set of directions that you need people to take, what are some of the ideas that you're planning to implement there to kind of evolve this process? Let me try to unpack that. Yeah, anyway, you want to find sure.

I think we've learned a ton over the last month speaking with some of our collaborators at Kaggle, like Anthony Goldblum was mentioning how they've really followed or Kaggle has fallen into this place with this challenge where they're kind of in new territory, like the type of challenge that the core 19 challenge is, is very unlike most of the challenges hosted by Kaggle. And there's a very open ended nature of it. We're like trying to discover answers, but there's no sense of like what a goal answer is. And they reacted to that in a very like kind of wonderful way by essentially harnessing these medical students as a resource to make judgments on people's extractions and kind of putting an effort there where it seemed like the results were most useful or were going to be the most useful.

So I guess that's one thing, which is trying to figure out as early as possible where the most useful results are and putting an additional effort there and maybe even like abandoning things that are not worth pursuing. And then I don't even remember what the second half of that question was. Things that you might be thinking going forward, but no worries, we can come back to that. That's right.

I mean, I have lots of thoughts on that as well. So certainly we're going to be supporting core 19 for as long as it makes sense to do so, certainly until the epidemic seems to wind down a bit. There have been lots of requests for additional features and additional content. So that is one of our priorities.

Additional content comes in two forms. One is simply providing more faithful parts of the papers. So right now, first things might be including things like inbound and outbound citations, tables and figures, other places where the answers might be. And these have been requested by a lot of folks.

Another is content in the sense of more papers. So one thing that we've been very grateful for is that a lot of publishers have made their COVID-19 articles open access. And by making them open access, they've allowed us to release a data set like core. But the fact is, if you look at the data set and at the papers that are being cited by the papers in the data set, there's actually a lot of content that are outside of this direct core set of articles on COVID-19 and coronaviruses that are also very relevant to the content of the data set.

So it would be great if we could work with publishers or they could work with us essentially to provide additional content that could be useful for discovering information about COVID-19 and its treatments. Yeah, so it's like you've amassed this kind of central hub, but from that hub of papers, obviously those papers cite other papers and those papers cite other papers and there's other related work and you can kind of go down a rabbit hole. I know we talked on the Semantic Scholar episode about this sort of graph of relations and that sort of thing as papers cite each other. I mean, there is a wealth of papers already in the data set.

What is the kind of current size of the data set? And you mentioned a few different sources. Could you just give us a sense of kind of the descriptive statistics of the data set at this point, I guess? Sure, probably should have started with that.

So the data set currently consists of more than 50,000 papers and approximately 40,000 of those papers have full-text content available. And these papers, as I mentioned, come from a diversity of sources. There's a list of WHO COVID-19 papers, several hundred of those that have been curated by the WHO. And there's preprints from BioArchive and MedArchive, that's numbers in the thousands.

And actually the vast majority of papers come to us via PubMed Central. And this number will actually continue to grow because many publishers are now depositing all COVID-19 content into PubMed Central. So Lucy, I'm kind of curious. So recognizing that your responsibility has been putting this together and getting it out into the world.

I imagine that you've probably talked with various teams or at least observed some of the efforts and work that's going on. And so I'm kind of curious, what are some of the more interesting or innovative or pick your adjective of the efforts that you've heard of and kind of gone, wow, that's a pretty cool way of approaching this? Any stories to tell us on that? I think folks have done a really great job diving deeply into this data set.

There's so many different kind of search engines that have cropped up over this data set. If you go to the data set landing page on to me and it's got a list of maybe several dozen that we've heard of and are enumerating there. I'm sure there's many more that we aren't aware of. People are pursuing lots of different technologies for these kind of search engines.

Some are kind of using the latest kind of state of the art transformer models, neuromodels for ranking, I think, COVIDX from Waterloo and NYU is using kind of the latest CIT5 model, very cool stuff. And some of the search engines are actually using very traditional methods using Lucy in our last search and focusing more on how to search and filter using entities or other paper features. And we've kind of a funny thing that we heard from Kaggle is that for many of the questions on the core 19 challenge, something simpler methods, the more traditional methods have actually worked better for kind of extracting answers. And this came as a surprise to me.

So I have a follow up on that and that is like for working better, you know, kind of putting that in quotes, what does that mean? Who out there is looking at the various results that are coming back from teams and making those evaluations and kind of, and obviously you've already said that you'll be steering the new tasks on Kaggle learning what we know. Who's making those evaluations and the decisions associated with that to keep everything focused? So I think this is a primarily work that has been taken on by organizers at Kaggle and kind of medical students that they've had come in and evaluate some of these answers.

So really like they put in a ton of effort in curating these results. And I think as for like the metrics that we use to judge these results, I think currently like they're mostly kind of information retrieval based metrics of success. Yeah, to give people an idea, I'm on the Kaggle website and just looking at some of the tasks to maybe for those listeners out there that don't have a good idea of the kind of scope of these, there's information or some of the tasks that are listed are what do we know about COVID-19 risk factors? What do we know about vaccines and therapeutics?

What has been published about medical care? And if I kind of dive into each of these, there's a number of submissions, but things like you know, chord 19 analysis with sentence embeddings, COVID-19 literature clustering, full-text search of research papers. And so you can kind of already see, and here's another one, Bert Squad for semantic corpus search. So you get a sense, I think, of what you're talking about, Lucy, that some people are kind of going after these sort of transformer based, maybe extractive QA sort of things.

Others are maybe using full-text search capabilities that have been out there for a while, like elastic search sort of capabilities like we talked about on the previous episode. Also if I'm looking at the data just to kind of have something in people's mind, I'm seeing, of course you have the categories of source, like the bio-archive and that sort of thing, but if I'm looking at the individual papers, I can see the abstract of the paper and the body text. And if I'm looking at the body text of this one, I'm reading things like to assess the effects of truncation of the poly-c tracked on replication, blah, blah, blah, which some of that doesn't mean a lot to me, but this is the sort of data that's in there. I was wondering as a complete new to a lot of this medical terminology, what are maybe some good ways, I know you've got this kind of cove is from the Allen Institute, which is helping kind of explore some of the, looks like the genes and cells and diseases and chemicals that are connected throughout the data set.

What are some good ways of kind of onboarding into the chord 19 work for people that might be sort of new to medical terminology? Are there ways to kind of pick up some of that or explore some of it and form some of those connections in a reasonable sort of way? To domain knowledge? Yeah, I think domain knowledge is one of the greatest barriers for working on this data set.

And thanks for mentioning the COVID project. That's a tool that was released by Allen AI for exploring this data set in a slightly more meaningful way. And for that tool, we essentially ran models to perform extractions of these entities from the text entities of different classes like drugs, genes, diseases, phenotypes, things of that nature and created visualization to allow you to browse the relationships that are most prevalent between pairs of these entities. So that's a great way to kind of explore what's in the data set.

There's also just like exploring the articles. We have a chord 19 explored how to do that. But in general, I think unless you're willing to spend sort of a couple of years of your life in medical school, it is very hard to understand what some of these terms mean. Certainly knowing what class or category of entity is being mentioned is important.

So knowing something is a protein, knowing something is a receptor, knowing something along so particular biological pathway. These are kind of key for gaining an initial understanding of what is being said in a text of the text. But that's also why we need medical experts to assess the actual utility of some of these extractions. So I'm wondering, I happen to have a daughter who is a third year medical student.

And I've told her very recently because we had the other episode about this and stuff, but she hadn't been aware of it. Is there any need to connect with medical schools? Has anybody kind of taken that on to try to gather those together and stuff? Because it's been obviously there's been an enormous effort in a very short amount of time, totally recognizing the constraints of the reality that we're in today.

Yeah. Is that something that y'all are kind of thinking about in terms of going forward, maybe for stage two or whatever you want to call it? Yeah, absolutely. So your daughter knows that I don't know if she's still, if medical school is continuing as usual, but I think during the third and fourth years, you're mostly in clinic.

So I know of a lot of medical schools where there are these kind of more senior medical students who really want to contribute how they can, but really aren't able to be in clinic at this moment. Yeah, they've been kicked out of the ER for, you know, they're working with the health department locally. And I think that kind of alternative work is really common right now for advanced medical students. Yeah, exactly.

So for the track test that I was mentioning that we're hosting, we actually are enlisting medical students from a number of institutions from the organs of health and science university and university of Texas and university of Washington to kind of help with providing annotations on some of these extractions. So I think depending on where your daughter is or where some of these medical students are, there's probably going to be other initiatives like this one that really need their help. So I would definitely encourage anyone to look out for that. I guess I have one more follow up to that that you just mentioned.

Do you think recognizing that and recognizing that we're going to be past this moment at some point, this kind of very unique moment in our history, but just as, you know, the kind of the widespread introduction of like open source software really changed, you know, industry itself from being highly proprietary to being, you know, open source became not only a part of business models, but even underlying part of a lot of commercial software that's out there. And it fundamentally changed how that worked. Do you think this is a moment just because COVID-19 passes us, you know, and we get past this, that maybe there are other challenges, whether they be things that we've been dealing with a long time like cancers or new things that may come, that this may fundamentally change how we attack, you know, really hard medical challenges with AI and that integration with the communities that has happened out of necessity. I'm certainly hoping that to be the case.

So we've definitely seen what people can do when they come together for a month or two, and it's incredible. Like, there's so many people being engaged and building like interesting tools and useful tools. I think there's a couple of maybe things that I'd love for us to be able to extend into the future. One thing is definitely publishers coming together to release more open access content on really kind of important topics such as COVID-19 and then the community coming together, especially crossing boundaries, crossing boundaries between computing, the computing community, the medical community and policy makers to really build something useful.

So I'm curious, a little bit of a follow up, I guess, to that question as well. I know there've been things as I've worked on related work with SIL where I was thinking, oh, if I would have done this prior to this crisis, like I would be able to do something better than what I'm able to do now. In hindsight, it's easy to see those opportunities. I'm curious on your side with your own research and work.

I mean, you were kind of, I'm assuming you were working on various things related to semantic scholar prior to this crisis. And now your head's down working on chord 19 and getting this in shape. What are you interested in exploring in the future? Not necessarily chord 19 related, but how has this whole process shifted like what you want to work on in your own research in the future?

Yeah. So I think this brings up a slightly earlier point, which is we really became involved in the creation of this data set because we had as semantic scholar built a bunch of infrastructure for scientific papers and a collaborator of mine, Kyle Lowe, and I had also been working for the past nearly a year on a way of essentially creating a full text extraction pipeline for some of these papers that we were using in chord 19 today. So a lot of this was infrastructural work. It's not particularly glamorous, but it is really important and it really became more important in light of what happened in the last few months.

I guess one thing is infrastructural improvements can be really important even if it's not particularly sexy. And then kind of going forwards, there's certainly things I care about besides kind of creating data sets of papers. And my research focuses on making scientific literature and making this content more available to biomedical researchers and more understandable. And as you mentioned before, there's so many entities, so many kind of like very domain-specific words and relationships that exist in the biomedical literature.

And even for someone who is a domain expert, some of those terms can be very hard to parse through and understand. So a lot of my ongoing projects are trying to create systems that understand particular types of relationships. For example, those that understand drug interactions or can mine the amount of the literature, those that can understand medical images better. These are the types of projects that I am hoping to continue to work on in the future.

Awesome. And I'm hoping to have the links to the main data set website in the show notes along with the CAGL challenge and the various other projects and groups that you've talked about. Really appreciate you coming on the show and describing a bit more about the data set, how it came about and your own work with it. Really encouraged by the work that the Semantic Scholar team and collaborators are doing.

And thank you for your hard work on this and taking time to talk to us. Yeah, thank you so much for having me. And we really hope that other folks are encouraged to kind of contribute and become involved in this project. Yes, please do.

Thank you for listening to this episode of Practical AI. More like this at changelog.com slash Practical AI. There you'll find our latest as well as lists of our most popular episodes and the ones we recommend. If this show has helped you on your AI journey, please leave us a five star review on Apple Podcasts, part us on Spotify, star us on Overcast, and tell a friend what they're missing out on.

Practical AI is hosted by Daniel Whiteneck and Chris Benson, is produced by me, Jared Santo. And our music is brought to you by the Beat Freak, Breakmaster Cylinder. We have awesome sponsors. Please support them.

They support us. Thanks again to Fastly, Linod, and Real Bar. If you and your organization could benefit from speaking directly to all the AI practitioners out there, you should sponsor the show. The podcast advertising is one of the most effective ways to spread your message in an authentic way.

Plus, you get the added bonus of supporting something you love. That's all for now. We'll talk to you next time.

Share this episode

Similar Episodes

Milk Proteins without the Dairy - Adam Tarshis and Dr. Cory Tobin

Jun 9, 2026 ·50m

New Technology in Severe Burn Care - Dr. Katie Bush

Jun 1, 2026 ·31m

New Methods in Early Cancer Detection - Dr. Nate Montgomery

May 25, 2026 ·39m

Strategies in Combating Chronic Kidney Disease - Dr. Salvadore Viscomi

May 17, 2026 ·37m

AI and the Future of Healthcare -- Dr. Emilia Javorsky

May 8, 2026 ·39m

The First Environmental GE Organism Release - almost! Dr. Steven Lindow

Apr 28, 2026 ·25m

Similar Podcasts

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 43 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on April 20, 2020.

What is this episode about?

In the midst of the COVID-19 pandemic, Daniel and Chris have a timely conversation with Lucy Lu Wang of the Allen Institute for Artificial Intelligence about COVID-19 Open Research Dataset (CORD-19). She relates how CORD-19 was created and...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.