Disseminate: The Computer Science Research Podcast

PODCAST · education

Disseminate: The Computer Science Research Podcast

This podcast features interviews with Computer Science researchers. Hosted by Dr. Jack Waudby researchers are interviewed, highlighting the problem(s) they tackled, solutions they developed, and how their findings can be applied in practice. This podcast is for industry practitioners, researchers, and students, aims to further narrow the gap between research and practice, and to generally make awesome Computer Science research more accessible. We have 2 types of episode: (i) Cutting Edge (red/blue logo) where we talk to researchers about their latest work, and (ii) High Impact (gold/silver logo) where we talk to researchers about their influential work.You can support the show through Buy Me a Coffee. A donation of $3

  1. 91

    Mateusz Gienieczko | AnyBlox: A Framework for Self-Decoding Datasets | #69

    In this episode of Disseminate: The Computer Science Research Podcast, host Dr. Jack Waudby is joined by Mateusz Gienieczko, PhD researcher at TU Munich and co-author of the VLDB Best Paper Award winning paper AnyBlox.They dive deep into a fundamental problem in modern data systems: why cutting-edge data encodings and file formats rarely make it from research into real-world systems — and how AnyBlox proposes a radical solution.Mateusz explains the core idea of self-decoding data, where datasets ship with their own portable, sandboxed decoders, allowing any database system to read any encoding safely and efficiently. Built on WebAssembly, AnyBlox bridges the long-standing gap between database research and practice without sacrificing performance, portability, or security.This episode is essential listening for database researchers, data engineers, system builders, and industry practitioners interested in the future of data formats, analytics performance, and making research matter in practiceLinks:Paper: https://www.vldb.org/pvldb/vol18/p4017-gienieczko.pdfGitHub: https://github.com/AnyBloxMat's Homepage: https://v0ldek.com/ Hosted on Acast. See acast.com/privacy for more information.

  2. 90

    Xiangyao Yu | Disaggregation: A New Architecture for Cloud Databases | #68

    In this episode of Disseminate: The Computer Science Research Podcast, host Jack Waudby sits down with Xiangyao Yu (UW–Madison), one of the leading voices shaping the next generation of cloud-native databases.We dive deep into disaggregation — the architectural shift transforming how modern data systems are built. Xiangyao breaks down:Why traditional shared-nothing databases struggle in cloud environmentsHow separating compute and storage unlocks elasticity, scalability, and cost efficiencyThe evolution of disaggregated systems, from Aurora and Snowflake through to advanced pushdown processing and new modular servicesHis team's research on reinventing core protocols like 2-phase commit for cloud-native environmentsReal-time analytics, HTAP challenges, and the Hermes architectureWhere disaggregation goes next — indexing, query optimizers, materialized views, multi-cloud architectures, and moreWhether you're a database engineer, researcher, or a practitioner building scalable cloud systems, this episode gives a clear, accessible look into the architecture that’s rapidly becoming the default for modern data platforms.Links:Xiangyao Yu's HomepageDisaggregation: A New Architecture for Cloud Databases [VLDB'25] Hosted on Acast. See acast.com/privacy for more information.

  3. 89

    Navid Eslami | Diva: Dynamic Range Filter for Var-Length Keys and Queries | #67

    In this episode of Disseminate: The Computer Science Research Podcast, Jack sits down with Navid Eslami, PhD researcher at the University of Toronto, to discuss his award-winning paper “DIVA: Dynamic Range Filter for Variable Length Keys and Queries”, which earned Best Research Paper at VLDB.Navid breaks down how range filters extend the power of traditional filters for modern databases and storage systems, enabling faster queries, better scalability, and theoretical guarantees. We dive into:How DIVA overcomes the limitations of existing range filtersWhat makes it the “holy grail” of filtering for dynamic dataReal-world integration in WiredTiger (the MongoDB storage engine)Future challenges in data distribution smoothing and hybrid filteringWhether you're a database engineer, systems researcher, or student exploring data structures, this episode reveals how cutting-edge research can transform how we query, filter, and scale modern data systems.Links:Diva: Dynamic Range Filter for Var-Length Keys and Queries [VLDB'25]Diva on GitHubNavid's LinkedIn Hosted on Acast. See acast.com/privacy for more information.

  4. 88

    Adaptive Factorization in DuckDB with Paul Groß

    In this episode of the DuckDB in Research series, host Jack Waudby sits down with Paul Groß, PhD student at CWI Amsterdam, to explore his work on adaptive factorization and worst-case optimal joins - techniques that push the boundaries of analytical query performance.Paul shares insights from his CIDR'25 paper “Adaptive Factorization Using Linear Chained Hash Tables”, revealing how decades of database theory meet modern, practical system design in DuckDB. From hash table internals to adaptive query planning, this episode uncovers how research innovations are becoming part of real-world systems.Whether you’re a database researcher, engineer, or curious student, you’ll come away with a deeper understanding of query optimization and the realities of systems engineering.Links:Adaptive Factorization Using Linear-Chained Hash Tables Hosted on Acast. See acast.com/privacy for more information.

  5. 87

    Parachute: Rethinking Query Execution and Bidirectional Information Flow in DuckDB - with Mihail Stoian

    In this episode of the DuckDB in Research series, host Jack Waudby sits down with Mihail Stoian, PhD student at the Data Systems Lab, University of Technology Nuremberg, to unpack the cutting-edge ideas behind Parachute, a new approach to robust query processing and bidirectional information passing in modern analytical databases.We explore how Parachute bridges theory and practice, combining concepts from instance-optimal algorithms and semi-join filtering to boost performance in DuckDB, the in-process analytical SQL engine that’s reshaping how research meets real-world data systems.Mihail discusses:How Parachute extends semi-join filtering for two-way information flowThe challenges of implementing research ideas inside DuckDBPractical performance gains on TPC-H and CEB workloadsThe future of adaptive query processing and research-driven system designWhether you're a database researcher, systems engineer, or curious practitioner, this deep-dive reveals how academic innovation continues to shape modern data infrastructure.Links:Parachute: Single-Pass Bi-Directional Information Passing VLDB 2025 PaperMihail's homepageParachute's Github repo Hosted on Acast. See acast.com/privacy for more information.

  6. 86

    Anarchy in the Database: Abigale Kim on DuckDB and DBMS Extensibility

    In this episode of the DuckDB in Research series, host Jack Waudby talks with Abigale Kim, PhD student at the University of Wisconsin–Madison and author of VLDB 2025 paper: “Anarchy in the Database: A Survey and Evaluation of DBMS Extensibility”. They explore how database extensibility is reshaping modern data systems — and why DuckDB is emerging as the gold standard for safe, flexible, and high-performance extensions. Abigale shares the inside story of her research, the surprises uncovered when testing Postgres and DuckDB extensions, and what’s next for extensibility and composable database design.This episode is perfect for researchers, practitioners, and students interested in databases, systems design, and the interplay between academia and industry innovation.Highlights:What “extensibility” really means in a DBMSHow DuckDB compares to Postgres, MySQL, and RedisThe rise of GPU-accelerated DuckDB extensionsWhy bridging research and engineering matters for the future of databasesLinks:Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility VLDB 2025Rethinking Analytical Processing in the GPU EraYou can find Abigale at:XBlueskyPersonal site Hosted on Acast. See acast.com/privacy for more information.

  7. 85

    Recursive CTEs, Trampolines, and Teaching Databases with DuckDB - with Prof. Torsten Grust

    In this episode of the DuckDB in Research series, host Dr Jack Waudby talks with Professor Torsten Grust from the University of Tübingen. Torsten is one of the pioneers behind DuckDB’s implementation of recursive CTEs.In the episode they unpack:The power of recursive CTEs and how they turn SQL into a full-fledged programming language.The story behind adding recursion to DuckDB, including the using key feature and the trampoline and TTL extensions emerging from Torsten’s lab.How these ideas are transforming research, teaching, and even DuckDB’s internal architecture.Why DuckDB makes databases exciting again — from classroom to cutting-edge systems research.If you’re into data systems, query processing, or bridging research and practice, this episode is for you.Links:USING KEY in Recursive CTEsHow DuckDB is USING KEY to Unlock Recursive Query PerformanceTrampoline-Style Queries for SQLU Tübingen Advent of codeA Fix for the Fixation on FixpointsOne WITH RECURSIVE is Worth Many GOTOsTorsten's homepageTorsten's X Hosted on Acast. See acast.com/privacy for more information.

  8. 84

    DuckDB in Research S2 Coming Soon!

    Hey folks! The DuckDB in Research series is back for S2!In this season we chat with:Torsten Grust: Recursive CTEsAbigale Kim: Anarchy in the DatabaseMihail Stoian: Parachute: Single-Pass Bi-Directional Information PassingPaul Gross: Adaptive Factorization Using Linear-Chained Hash TablesWhether you're a researcher, engineer, or just curious about the intersection of databases and innovation we are sure you will love this series. Hosted on Acast. See acast.com/privacy for more information.

  9. 83

    Rohan Padhye & Ao Li | Fray: An Efficient General-Purpose Concurrency JVM Testing Platform | #66

    In this episode of Disseminate: The Computer Science Research Podcast, guest host Bogdan Stoica sits down with Ao Li and Rohan Padhye (Carnegie Mellon University) to discuss their OOPSLA 2025 paper: "Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM".We dive into:Why concurrency bugs remain so hard to catch -- even in "well-tested" Java projects.The design of Fray, a new concurrency testing platform that outperforms prior tools like JPF and rr.Real-world bugs discovered in Apache Kafka, Lucene, and Google Guava.The gap between academic research and industrial practice, and how Fray bridges it.What’s next for concurrency testing: debugging tools, distributed systems, and beyond.If you’re a Java developer, systems researcher, or just curious about how to make software more reliable, this conversation is packed with insights on the future of software testing.Links & Resources:- The Fray paper (OOPSLA 2025):- Fray on GitHub- Ao Li’s research - Rohan Padhye’s research Don’t forget to like, subscribe, and hit the 🔔 to stay updated on the latest episodes about cutting-edge computer science research.#Java #Concurrency #SoftwareTesting #Fray #OOPSLA2025 #Programming #Debugging #JVM #ComputerScience #ResearchPodcast Hosted on Acast. See acast.com/privacy for more information.

  10. 82

    Shrey Tiwari | It's About Time: A Study of Date and Time Bugs in Python Software | #65

    In this episode, Bogdan Stoica, Postdoctoral Research Associate in the SysNet group at the University of Illinois Urbana-Champaign (UIUC) steps in to guest host. Bogdan sits down with Shrey Tiwari, a PhD student in the Software and Societal Systems Department at Carnegie Mellon University and member of the PASTA Lab, advised by Prof. Rohan Padhye. Together, they dive into Shrey’s award-winning research on date and time bugs in open-source Python software, exploring why these issues are so deceptively tricky and how they continue to affect systems we rely on every day.The conversation traces Shrey’s journey from industry to research, including formative experiences at Citrix and Microsoft Research, and how those shaped his passion for software reliability. Shrey and Bogdan discuss the surprising complexity of date and time handling, the methodology behind Shrey’s empirical study, and the practical lessons developers can take away to build more robust systems. Along the way, they highlight broader questions about testing, bug detection, and the future role of AI in ensuring software correctness. This episode is a must-listen for anyone interested in debugging, reliability, and the hidden challenges that underpin modern software.Links:It’s About Time: An Empirical Study of Date and Time Bugs in Open-Source Python Software 🏆 ACM SIGSOFT Distinguished Paper AwardShrey's homepage Hosted on Acast. See acast.com/privacy for more information.

  11. 81

    Lessons Learned from Five Years of Artifact Evaluations at EuroSys | #64

    In this episode we are joined by Thaleia Doudali, Miguel Matos, and Anjo Vahldiek-Oberwagner to delve into five years of experience managing artifact evaluation at the EuroSys conference. They explain the goals and mechanics of artifact evaluation, a voluntary process that encourages reproducibility and reusability in computer systems research by assessing the supporting code, data, and documentation of accepted papers. The conversation outlines the three-tiered badge system, the multi-phase review process, and the importance of open-source practices. The guests present data showing increasing participation, sustained artifact availability, and varying levels of community engagement, underscoring the growing relevance of artifacts in validating and extending research.The discussion also highlights recurring challenges such as tight timelines between paper acceptance and camera-ready deadlines, disparities in expectations between main program and artifact committees, difficulties with specialized hardware requirements, and lack of institutional continuity among evaluators. To address these, the guests propose early artifact preparation, stronger integration across committees, formalization of evaluation guidelines, and possibly making artifact submission mandatory. They advocate for broader standardization across CS subfields and suggest introducing a “Test of Time” award for artifacts. Looking to the future, they envision a more scalable, consistent, and impactful artifact evaluation process—but caution that continued growth in paper volume will demand innovation to maintain quality and reviewer sustainability.Links:Lessons Learned from Five Years of Artifact Evaluations at EuroSys [DOI] Thaleia's HomepageAnjo's HomepageMiguel's Homepage Hosted on Acast. See acast.com/privacy for more information.

  12. 80

    Dominik Winterer | Validating SMT Solvers for Correctness and Performance via Grammar-based Enumeration | #63

    In this episode of the Disseminate podcast, Dominik Winterer discusses his research on SMT (Satisfiability Modulo Theories) solvers and his recent OOPSLA paper titled "Validating SMT Solvers for Correction and Performance via Grammar Based Enumeration". Dominik shares his academic journey from the University of Freiburg to ETH Zurich, and now to a lectureship at the University of Manchester. He introduces ET, a tool he developed for exhaustive grammar-based testing of SMT solvers. Unlike traditional fuzzers that use random input generation, ET systematically enumerates small, syntactically valid inputs using context-free grammars to expose bugs more effectively. This approach simplifies bug triage and has revealed over 100 bugs—many of them soundness and performance-related—with a striking number having already been fixed. Dominik emphasizes the tool’s surprising ability to identify deep bugs using minimal input and track solver evolution over time, highlighting ET's potential for continuous integration into CI pipelines.The conversation then expands into broader reflections on formal methods and the future of software reliability. Dominik advocates for a new discipline—Formal Methods Engineering—to bridge the gap between software engineering and formal verification tools. He stresses the importance of building trustworthy verification tools since the reliability of software increasingly depends on them. Dominik also discusses adapting ET to other domains, such as JavaScript engines, and suggests that grammar-based enumeration can be applied widely to any system with a context-free grammar. Addressing the rise of AI, he envisions validation portfolios that integrate formal methods into LLM-based tooling, offering certified assessments of model outputs. He closes with a call for the community to embrace pragmatic, systematic, and scalable approaches to formal methods to ensure these tools can live up to their promises in real-world development settings.Links:Dominik's HomepageValidating SMT Solvers for Correctness and Performance via Grammar-Based Enumeration Hosted on Acast. See acast.com/privacy for more information.

  13. 79

    Haralampos Gavriilidis | Fast and Scalable Data Transfer across Data Systems | #62

    In this episode of Disseminate, we welcome Harry Gavrilidis back to the podcast to explore his latest research on fast and scalable data transfer across systems, soon to be presented at SIGMOD 2025. Building on his work with XDB, Harry introduces XDBC, a novel data transfer framework designed to balance performance and generalizability. They dive into the challenges of moving data across heterogeneous environments—ranging from cloud systems to IoT devices—and critique the limitations of current generic methods like JDBC and specialized point-to-point connectors.Harry walks us through the architecture of XDBC, which modularizes the data transfer pipeline into configurable stages like reading, serialization, compression, and networking. The episode highlights how this architecture adapts to varying performance constraints and introduces a cost-based optimizer to automate tuning for different environments. We also touch on future directions, including dynamic reconfiguration, fault tolerance, and learning-based optimizations. If you're interested in systems, performance engineering, or database interoperability, this episode is a must-listen. Hosted on Acast. See acast.com/privacy for more information.

  14. 78

    Haralampos Gavriilidis | SheetReader: Efficient spreadsheet parsing

    In this episode of the DuckDB in Research series, Harry Gavriilidis (PhD student at TU Berlin) joins us to discuss Sheet Reader — a high-performance spreadsheet parser that dramatically outpaces traditional tools in both speed and memory efficiency. By taking advantage of the standardized structure of spreadsheet files and bypassing generic XML parsers, Sheet Reader delivers fast and lightweight parsing, even on large files. Now available as a DuckDB extension, it enables users to query spreadsheets directly with SQL and integrate them seamlessly into broader analytical workflows.Harry shares insights into the development process, performance benchmarks, and the surprisingly complex world of spreadsheet parsing. He also discusses community feedback, feature requests (like detecting multiple tables or parsing colored rows), and future plans — including tighter integration with DuckDB and support for Arrow. The conversation wraps up with a look at Harry’s broader research on composable database systems and data interoperability, highlighting how tools like DuckDB are reshaping modern data analysis. Hosted on Acast. See acast.com/privacy for more information.

  15. 77

    Arjen P. de Vries | faiss: An extension for vector data & search

    In this episode of the DuckDB in Research series, we’re joined by Arjen de Vries, Professor of Data Science at Radboud University. Arjen dives into his team’s development of a DuckDB extension for FAISS, a library originally developed at Facebook for efficient similarity search and vector operations.We explore the growing importance of embeddings and dense retrieval in modern information retrieval systems, and how DuckDB’s zero-copy architecture and tight integration with the Python ecosystem make it a compelling choice for managing large-scale vector data. Arjen shares insights into the technical challenges and architectural decisions behind the extension, comparisons with DuckDB’s native VSS (vector search) solution, and the broader vision of integrating vector search more deeply into relational databases.Along the way, we also touch on DuckDB's extension ecosystem, its potential for future research, and why tools like this are reshaping how we build and query modern AI-enabled systems. Hosted on Acast. See acast.com/privacy for more information.

  16. 76

    David Justen | POLAR: Adaptive and non-invasive join order selection via plans of least resistance

    In this episode, we sit down with David Justen to discuss his work on POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance which was implemented in DuckDB. David shares his journey in the database space, insights into performance optimization, and the challenges of working with modern analytical workloads. We dive into the intricacies of query compilation, vectorized execution, and how DuckDB is shaping the future of in-memory databases. Tune in for a deep dive into database internals, industry trends, and what’s next for high-performance data processing!Links: VLDB 2024 PaperDavid's Homepage Hosted on Acast. See acast.com/privacy for more information.

  17. 75

    Daniël ten Wolde | DuckPGQ: A graph extension supporting SQL/PGQ

    In this episode, we sit down with Daniël ten Wolde, a PhD researcher at CWI’s Database Architectures Group, to explore DuckPGQ—an extension to DuckDB that brings powerful graph querying capabilities to relational databases. Daniel shares his journey into database research, the motivations behind DuckPGQ, and how it simplifies working with graph data. We also dive into the technical challenges of implementing SQL Property Graph Queries (SQL PGQ) in DuckDB, discuss performance benchmarks, and explore the future of DuckPGQ in graph analytics and machine learning. Tune in to learn how this cutting-edge extension is bridging the gap between research and industry!Links:DuckPGQ homepageCommunity extensionDaniel's homepage Hosted on Acast. See acast.com/privacy for more information.

  18. 74

    Till Döhmen | DuckDQ: A Python library for data quality checks in ML pipelines

    In this episode we kick off our DuckDB in Research series with Till Döhmen, a software engineer at MotherDuck, where he leads AI efforts. Till shares insights into DuckDQ, a Python library designed for efficient data quality validation in machine learning pipelines, leveraging DuckDB’s high-performance querying capabilities.We discuss the challenges of ensuring data integrity in ML workflows, the inefficiencies of existing solutions, and how DuckDQ provides a lightweight, drop-in replacement that seamlessly integrates with scikit-learn. Till also reflects on his research journey, the impact of DuckDB’s optimizations, and the future potential of data quality tooling. Plus, we explore how AI tools like ChatGPT are reshaping research and productivity. Tune in for a deep dive into the intersection of databases, machine learning, and data validation!Resources:GitHubPaperSlidesTill's Homepagedatasketches extension (released by a DuckDB community member 2 weeks after we recorded!) Hosted on Acast. See acast.com/privacy for more information.

  19. 73

    Disseminate x DuckDB Coming Soon...

    Hey folks! We have been collaborating with everyone's favourite in-process SQL OLAP database management system DuckDB to bring you a new podcast series - the DuckDB in Research series!At Disseminate our mission is to bridge the gap between research and industry by exploring research that has a real-world impact. DuckDB embodies this synergy—decades of research underpin its design, and now it’s making waves in the research community as a platform for others to build on and this is what the series will focus on! Join us as we kick off the series with:📌 Daniel ten Wolde – DuckPGQ, a graph workload extension for DuckDB supporting SQL/PGQ📌 David Justen – POLAR: Adaptive, non-invasive join order selection 📌 Till Döhmen – DuckDQ: A Python library for data quality checks in ML pipelines📌 Arjen de Vries – FAISS extension for vector similarity search in DuckDB📌 Harry Gavriilidis – SheetReader: Efficient spreadsheet parsingWhether you're a researcher, engineer, or just curious about the intersection of databases and innovation we are sure you will love this series. Subscribe now and stay tuned for our first episode! 🚀 Hosted on Acast. See acast.com/privacy for more information.

  20. 72

    High Impact in Databases with... Anastasia Ailamaki

    In this High Impact in Databases episode we talk to Anastasia Ailamaki.Anastasia is a Professor of Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL). Tune in to hear Anastasia's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find Anastasia on:HomepageGoogle ScholarLinkedIn Hosted on Acast. See acast.com/privacy for more information.

  21. 71

    Anastasiia Kozar | Fault Tolerance Placement in the Internet of Things | #61

    In this episode, we chat with Anastasiia Kozar about her research on fault tolerance in resource-constrained environments. As IoT applications leverage sensors, edge devices, and cloud infrastructure, ensuring system reliability at the edge poses unique challenges. Unlike the cloud, edge devices operate without persistent backups or high availability standards, leading to increased vulnerability to failures. Anastasiia explains how traditional methods fall short, as they fail to align resource allocation with fault tolerance needs, often resulting in system underperformance.To address this, Anastasiia introduces a novel resource-aware approach that combines operator placement and fault tolerance into a unified process. By optimizing where and how data is backed up, her solution significantly improves system reliability, especially for low-end edge devices with limited resources. The result? Up to a tenfold increase in throughput compared to existing methods. Tune to learn more! Links:Fault Tolerance Placement in the Internet of Things [SIGMOD'24]The NebulaStream Platform: Data and Application Management for the Internet of Things [CIDR'20]nebula.stream Hosted on Acast. See acast.com/privacy for more information.

  22. 70

    Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60

    In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to embedded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph traversal. This allows ACORN to outperform existing methods significantly, supporting complex query semantics and achieving 2–1,000 times higher throughput on diverse datasets. Tune in to learn more!Links:ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data [SIGMOD'24]Liana's LinkedInLiana's X Hosted on Acast. See acast.com/privacy for more information.

  23. 69

    High Impact in Databases with... David Maier

    In this High Impact episode we talk to David Maier.David is the Maseeh Professor Emeritus of Emerging Technologies at Portland State University. Tune in to hear David's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find David on:HomepageGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.

  24. 68

    Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

    In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.Materials:SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data LakesICDE'24 - Towards Optimizing Storage Costs in the Cloud Hosted on Acast. See acast.com/privacy for more information.

  25. 67

    High Impact in Databases with... Aditya Parameswaran

    In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work.Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Links:EPIC Data LabAnswering Queries using Humans, Algorithms and Databases (CIDR'11)Potter’s Wheel: An Interactive Data Cleaning System (VLDB'01)Online Aggregation (SIGMOD'97)Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases (INFOVIS'00)Coping with Rejection PonderYou can find Aditya on:TwitterLinkedInGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.

  26. 66

    Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58

    In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-based solutions and the common challenges they face in maintaining accuracy and speed under such conditions.The highlight of our conversation is Grafite, a novel range filter introduced by Marco and his team. Unlike previous approaches, Grafite comes with clear theoretical guarantees and offers robust performance across various datasets, query sizes, and workloads. Marco dives into the technicalities, explaining how Grafite delivers faster query times and maintains predictable false positive rates, making it the most reliable range filter in scenarios where queries are correlated with keys. Additionally, he introduces a simple heuristic filter that excels in uncorrelated queries, pushing the boundaries of current solutions in the field.SIGMOD' 24 Paper - Grafite: Taming Adversarial Queries with Optimal Range Filters Hosted on Acast. See acast.com/privacy for more information.

  27. 65

    High Impact in Databases with... Ali Dasdan

    In this High Impact episode we talk to Ali Dasdan, CTO at Zoominfo. Tune in to hear Ali's story and learn about some of his most impactful work such as his work on "Map-Reduce-Merge".The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Materials mentioned on this episode:Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters (SIGMOD'07)The Art of Doing Science and Engineering: Learning to Learn, Richard HammingHow to Solve It, George PolyaSystems Architecting: Creating & Building Complex Systems, Eberhardt RechtinYou can find Ali on:TwitterLinkedIn Hosted on Acast. See acast.com/privacy for more information.

  28. 64

    Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

    In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands present a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query volumes, this approach becomes cost-prohibitive as query volumes increase, highlighting the need for a more balanced strategy.Matt introduces us to a novel strategy that combines the best of both worlds: the rapid scalability of cloud functions and the cost-effectiveness of virtual machines. This innovative approach leverages the fast but expensive cloud functions alongside slow-starting yet inexpensive virtual machines to provide elasticity without sacrificing cost efficiency. He elaborates on how their implementation, called Cackle, achieves consistent performance and cost savings across a wide range of workloads and conditions. Tune in to learn how Cackle avoids the pitfalls of traditional approaches, delivering stable query performance and minimizing costs even as demand fluctuates wildly.Links:Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD'24]Matt's Homepage Hosted on Acast. See acast.com/privacy for more information.

  29. 63

    High Impact in Databases with... Andreas Kipf

    In this High Impact episode we talk to Andreas Kipf about his work on "Learned Cardinalities". Andreas is the Professor of Data Systems at Technische Universität Nürnberg (UTN). Tune in to hear Andreas's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Papers mentioned on this episode:Learned Cardinalities: Estimating Correlated Joins with Deep Learning CIDR'19The Case for Learned Index Structures SIGMOD'18Adaptive Optimization of Very Large Join Queries SIGMOD'18You can find Andreas on:TwitterLinkedIn Google ScholarData Systems Lab @ UTN Hosted on Acast. See acast.com/privacy for more information.

  30. 62

    Marvin Wyrich & Justus Bogner | How Software Engineering Research Is Discussed on LinkedIn | #56

    In this episode, we delve into the intersection of software engineering (SE) research and professional practice with experts Marvin Wyrich and Justus Bogner. As LinkedIn stands as the largest professional network globally, it serves as a critical platform for bridging the gap between SE researchers and practitioners. Marvin and Justus explore the dynamics of how research findings are shared and discussed on LinkedIn, providing both quantitative and qualitative insights into the effectiveness of these interactions. They reveal that a significant portion of SE research posts on LinkedIn are authored by individuals outside the original research team and that a majority of comments on these posts come from industry professionals, highlighting a vibrant but underutilized avenue for science communication.Our guests shed light on the current state of this metaphorical bridge, emphasizing the potential for LinkedIn to enhance collaboration and knowledge exchange between academia and industry. Despite the promising engagement from practitioners, the discussion reveals that only half of the SE research posts receive any comments, indicating room for improvement in fostering more interactive dialogues. Marvin and Justus offer practical advice for researchers to better engage with practitioners on LinkedIn and suggest strategies for making research dissemination more impactful. This episode provides valuable insights for anyone interested in leveraging social media for advancing software engineering knowledge and practice.Links:ICSE'24 PaperMarvin's HomepageJustus's Homepage Hosted on Acast. See acast.com/privacy for more information.

  31. 61

    High Impact in Databases with... Joe Hellerstein

    In this High Impact episode we talk to Joe Hellerstein.Joe is the Jim Gray Professor of Computer Science at UC Berkeley. Tune in to hear Joe's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information.

  32. 60

    Harry Goldstein | Property-Based Testing | #55

    In this episode, we chat with Harry Goldstein about Property-Based Testing (PBT). Harry shares insights from interviews with PBT users at Jane Street, highlighting PBT's strengths in testing complex code and boosting developer confidence. Harry also discusses the challenges of writing properties and generating random data, and the difficulties in assessing test effectiveness. He identifies key areas for future improvement, such as performance enhancements and better random input generation. This episode is essential for those interested in the latest developments in software testing and PBT's future.Links:ICSE'24 Paper Harry's websiteX: @hgoldstein95 Hosted on Acast. See acast.com/privacy for more information.

  33. 59

    High Impact in Databases with... Raghu Ramakrishnan

    In this High Impact episode we talk to Raghu Ramakrishnan.Raghu is CTO for Data and a Technical Fellow at Microsoft. Tune in to hear Raghu's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information.

  34. 58

    Gina Yuan | In-Network Assistance With Sidekick Protocols | #54

    Join us as we chat with Gina Yuan about her pioneering work on sidekick protocols, designed to enhance the performance of encrypted transport protocols like QUIC and WebRTC. These protocols ensure privacy but limit in-network innovations. Gina explains how sidekick protocols allow intermediaries to assist endpoints without compromising encryption.Discover how Gina tackles the challenge of referencing opaque packets with her innovative quACK tool and learn about the real-world benefits, including improved Wi-Fi retransmissions, energy-saving proxy acknowledgments, and the PACUBIC congestion-control mechanism. This episode offers a glimpse into the future of network performance and security.Links:NSDI'2024 PaperGina's HomepageSidekick's Github Repo Hosted on Acast. See acast.com/privacy for more information.

  35. 57

    High Impact in Databases with... Moshe Vardi

    Welcome to another episode of the High Impact series - today we talk with Moshe Vardi! Moshe is the Karen George Distinguished Service Professor in Computational Engineering at Rice University where his research focuses on automated reasoning. Tune in to hear Moshe's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find Moshe on X, LinkedIn, and Mastadon @vardi. Links to all his work can be found on his website here. Hosted on Acast. See acast.com/privacy for more information.

  36. 56

    Tammy Sukprasert | Move Your Workloads To Sweden! | #53

    In this episode, we dip our toes into the world of sustainable computing and interview Tammy Sukprasert about her research on reducing carbon emissions in cloud computing through workload scheduling. Tammy explores the concept of shifting cloud workloads across different times and locations to coincide with low-carbon energy availability. Unlike previous studies that focused on specific regions or workloads, her comprehensive analysis uses carbon intensity data from 123 regions to assess both batch and interactive workloads. She considers various factors such as job duration, deadlines, and service level objectives (SLOs). Tammy's findings reveal that while spatiotemporal workload shifting can reduce carbon emissions, the practical upper bounds of these reductions are limited and far from ideal. Simple scheduling policies often achieve most of the potential reductions, with more complex techniques offering minimal additional benefits.Additionally, Tammy's research highlights that as the energy grid becomes greener, the benefits of carbon-aware scheduling over carbon-agnostic approaches decrease. This discussion offers crucial insights for the future of cloud computing and sustainable technology. Whether you're a tech enthusiast, environmental advocate, or cloud industry professional, Tammy's work provides valuable perspectives on the intersection of technology and sustainability. Join us to learn more about how innovative scheduling strategies can contribute to a greener cloud computing landscape.Links:Tammy's LinkedInOn the Limitations of Carbon-Aware Temporal and Spatial Workload Shifting in the Cloud EuroSys'24 Paper Carbon Savings Upper Bound Analysis Hosted on Acast. See acast.com/privacy for more information.

  37. 55

    High Impact in Databases with... Ryan Marcus

    Welcome to the first episode of the High Impact series!The High Impact series is inspired by a blog post “Most Influential Database Papers" by Ryan Marcus and today we talk to Ryan! Tune in to hear about Ryan's story so far. We chat about his current work before moving on to discuss his most impactful work. We also dig into what motivates him and how he handles setbacks, as well as getting his take on the current trends.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Links:Most influential database papersRyan's websiteRyan's twitter/XBao: Making Learned Query Optimization PracticalNeo: A Learned Query Optimizer Hosted on Acast. See acast.com/privacy for more information.

  38. 54

    Yazhuo Zhang | SIEVE is Simpler than LRU | #52

    In this episode, we explore the world of caching with Yazhuo Zhang, who introduces the game-changing SIEVE algorithm. Traditional eviction algorithms have long struggled with a trade-off between efficiency, throughput, and simplicity. However, SIEVE disrupts this balance by offering a simpler alternative to LRU while outperforming state-of-the-art algorithms in both efficiency and scalability for web cache workloads. Implemented in five production cache libraries with minimal code changes, SIEVE's superiority shines through in a comprehensive evaluation across 1559 cache traces. With up to a remarkable 63.2% lower miss ratio than ARC and surpassing nine other algorithms in over 45% of cases, SIEVE's simplicity doesn't compromise on scalability, doubling throughput compared to optimized LRU implementations. Join us as Yazhuo reveals how SIEVE is set to redefine caching efficiency, promising faster and more streamlined data serving in production systems.Links:SIEVE is Simpler than LRU: an Efficient Turn-Key Eviction Algorithm for Web Caches (NSDI'24)FIFO Queues are All You Need for Cache Eviction (SOSP'23)Yazhuo's homepageYazhuo's LinkedInYazhuo's Twitter/XCachemon/SIEVE's websiteS3FIFO website Hosted on Acast. See acast.com/privacy for more information.

  39. 53

    Introducing the High Impact Series...

    Introducing the High Impact Series! Hey folks, we have a new series coming soon inspired by a blog post “Most Influential Database Papers" by Ryan Marcus. The series will feature interviews with the authors of some of the most impactful work in the field of databases. We will talk about the story behind some of their most impactful work, getting them to reflect on the impact it has had over years, as well as getting their take on the current trends in the field. Proudly sponsored by Pometry Hosted on Acast. See acast.com/privacy for more information.

  40. 52

    Eleni Zapridou | Oligolithic Cross-task Optimizations across Isolated Workloads | #51

    In this episode, we talk to Eleni Zapridou and delve into the challenges of data processing within enterprises, where multiple applications operate concurrently on shared resources. Traditional resource boundaries between applications often lead to increased costs and resource consumption. However, as Eleni explains the principle of functional isolation offers a solution by combining cross-task optimizations with performance isolation. We explore GroupShare, an innovative strategy that reduces CPU consumption and query latency, transforming data processing efficiency. Join us as we discuss the implications of functional isolation with Eleni and its potential to revolutionize enterprise data processing.Links:CIDR'24 PaperEleni's TwitterEleni's LinkedIn Hosted on Acast. See acast.com/privacy for more information.

  41. 51

    Pat Helland | Scalable OLTP in the Cloud: What’s the BIG DEAL? | #50

    In this thought-provoking podcast episode, we dive into the world of scalable OLTP (OnLine Transaction Processing) systems with the insightful Pat Helland. As a seasoned expert in the field, Pat shares his insights on the critical role of isolation semantics in the scalability of OLTP systems, emphasizing its significance as the "BIG DEAL." By examining the interface between OLTP databases and applications, particularly through the lens of RCSI (READ COMMITTED SNAPSHOT ISOLATION) SQL databases, Pat talks about the limitations imposed by current database architectures and application patterns on scalability.Through a compelling thought experiment, Pat explores the asymptotic limits to scale for OLTP systems, challenging the status quo and envisioning a reimagined approach to building both databases and applications that empowers scalability while adhering to established to RCSI. By shedding light on how today's popular databases and common app patterns may unnecessarily hinder scalability, Pat sparks discussions within the database community, paving the way for new opportunities and advancements in OLTP systems. Join us as we delve into this conversation with Pat Helland, where every insight shared could potentially catalyze significant transformations in the realm of OLTP scalability.Papers mentioned during the episode:Scalable OLTP in the Cloud: What’s the BIG DEAL?Autonomous ComputingDecoupled TransactionsDon't Get Stuck in the "Con" GameThe Best Place to Build a SubwayBuilding on QuicksandSide effects, front and centerImmutability changes everythingIs Scalable OLTP in the Cloud a solved problem?You can find Pat on:Twitter/XLinkedInScattered Thoughts on Distributed Systems Hosted on Acast. See acast.com/privacy for more information.

  42. 50

    Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49

    In this episode, we talk to Rui Liu and explore the transformative potential of Ratchet, a groundbreaking resource-adaptive query execution framework. We delve into the challenges posed by ephemeral resources in modern cloud environments and the innovative solutions offered by Ratchet. Rui guides us through the intricacies of Ratchet's design, highlighting its ability to enable adaptive query suspension and resumption, sophisticated resource arbitration for diverse workloads, and a fine-grained pricing model to navigate fluctuating resource availability. Join us as we uncover the future of cloud-native databases and workloads, and discover how Ratchet is poised to revolutionize the way we harness the power of dynamic cloud resources.Links:CIDR'24 PaperRui's LinkedIn Rui's Twitter/XRui's HomepageYou can find links to all Rui's work from his Google Scholar profile. Hosted on Acast. See acast.com/privacy for more information.

  43. 49

    Yifei Yang | Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries | #48

    In this episode, Yifei Yang introduces predicate transfer, a revolutionary method for optimizing join performance in databases. Predicate transfer builds on Bloom joins, extending its benefits to multi-table joins. Inspired by Yannakakis's theoretical insights, predicate transfer leverages Bloom filters to achieve significant speed improvements. Yang's evaluation shows an average 3.3× performance boost over Bloom join on the TPC-H benchmark, highlighting the potential of predicate transfer to revolutionize database query optimization. Join us as we explore the transformative impact of predicate transfer on database operations.Links:CIDR'24 PaperYifei's LinkedInBuy Me A CoffeeListener Survey Hosted on Acast. See acast.com/privacy for more information.

  44. 48

    Vikramank Singh | Panda: Performance Debugging for Databases using LLM Agents | #47

    In this episode, Vikramank Singh introduces the Panda framework, aimed at refining Large Language Models' (LLMs) capability to address database performance issues. Vikramank elaborates on Panda's four components—Grounding, Verification, Affordance, and Feedback—illustrating how they collaborate to contextualize LLM responses and deliver actionable recommendations. By bridging the divide between technical knowledge and practical troubleshooting needs, Panda has the potential to revolutionize database debugging practices, offering a promising avenue for more effective and efficient resolution of performance challenges in database systems. Tune in to learn more! Links:CIDR'24 PaperVikramank's LinkedIn Hosted on Acast. See acast.com/privacy for more information.

  45. 47

    Tamer Eldeeb | Chablis: Fast and General Transactions in Geo-Distributed Systems | #46

    In this episode, Tamer Eldeeb sheds light on the challenges faced by geo-distributed database management systems (DBMSes) in supporting strictly-serializable transactions across multiple regions. He discusses the compromises often made between low-latency regional writes and restricted programming models in existing DBMS solutions. Tamer introduces Chablis, a groundbreaking geo-distributed, multi-versioned transactional key-value store designed to overcome these limitations.Chablis offers a general interface accommodating range and point reads, along with writes within multi-step strictly-serializable ACID transactions. Leveraging advancements in low-latency datacenter networks and innovative DBMS designs, Chablis eliminates the need for compromises, ensuring fast read-write transactions with low latency within a single region, while enabling global strictly-serializable lock-free snapshot reads. Join us as we explore the transformative potential of Chablis in revolutionizing the landscape of geo-distributed DBMSes and facilitating seamless transactional operations across distributed environments.CIDR'24 Chablis PaperOSDI'23 Chardonnay paperTamer's Linkedin Hosted on Acast. See acast.com/privacy for more information.

  46. 46

    Matt Butrovich | Tigger: A Database Proxy That Bounces With User-Bypass | #45

    Summary: In this episode, we chat to Matt Butrovich about his research on database proxies. We discuss the inefficiencies of traditional database proxies, which operate in user-space, causing overhead due to buffer copying and system calls. Matt introduces "user-bypass" which leverages Linux's eBPF infrastructure to move application logic into kernel-space. Matt then tells us about Tigger, a PostgreSQL-compatible DBMS proxy, showcasing user-bypass benefits. Tune in to hear about the experiments that demonstrate how Tigger can achieve up to a 29% reduction in transaction latencies and a 42% reduction in CPU utilization compared to other widely-used proxies.Links: Matt's homepageVLDB'23 paperTigger's Github repo Hosted on Acast. See acast.com/privacy for more information.

  47. 45

    Gábor Szárnyas | The LDBC Social Network Benchmark: Business Intelligence Workload | #44

    Summary: In this episode, Gábor Szárnyas takes us on a journey through the LDBC Social Network Benchmark's Business Intelligence workload (SNB BI). Developed through collaboration between academia and industry the SNB BI is a comprehensive graph OLAP benchmark. It pushes the boundaries of synthetic and scalable analytical database benchmarks, featuring a sophisticated data generator and a temporal graph with small-world phenomena. The benchmark's query workload, rooted in LDBC's innovative design methodology, aims to drive future technical advancements in graph database systems. Gabor highlights SNB BI's unique features, including the adoption of "parameter curation" for stable query runtimes across diverse parameters. Join us for a succinct yet insightful exploration of SNB BI, where Gábor Szárnyas unveils the intricacies shaping the forefront of analytical data systems and graph workloads.Links: VLDB'23 PaperGabor's HomepageLDBC HomepageLDBC GitHub Hosted on Acast. See acast.com/privacy for more information.

  48. 44

    Thaleia Doudali | Is Machine Learning Necessary for Cloud Resource Usage Forecasting? | #43

    Summary:In this week's episode, we talk with Thaleia Doudali and explore the realm of cloud resource forecasting, focusing on the use of Long Short Term Memory (LSTM) neural networks, a popular machine learning model. Drawing from her research, Thaleia discusses the surprising discovery that, despite the complexity of ML models, accurate predictions often boil down to a simple shift of values by one time step. The discussion explores the nuances of time series data, encompassing resource metrics like CPU, memory, network, and disk I/O across different cloud providers and levels. Thaleia highlights the minimal variations observed in consecutive time steps, prompting a critical question: Do we really need complex machine learning models for effective forecasting? The episode concludes with Thaleia's vision for practical resource management systems, advocating for a thoughtful balance between simple solutions, such as data shifts, and the application of machine learning. Tune in as we unravel the layers of cloud resource forecasting with Thaleia Doudali.Links:SoCC'23 PaperThaleia's HomepageIMDEA Software HomepageGitHub Repo Hosted on Acast. See acast.com/privacy for more information.

  49. 43

    Jinkun Geng | Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks | #42

    Summary: In this episode Jinkun Geng talks to us about Nezha, a high-performance consensus protocol. Nezha can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap between protocols such as MultiPaxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have. Tune in to learn more! Links: Jinkun's HomepageNezha VLDB'23 PaperNezha GitLab Repo Hosted on Acast. See acast.com/privacy for more information.

  50. 42

    Dimitris Koutsoukos | NVM: Is it Not Very Meaningful for Databases? | #41

    Summary: In this episode, Dimitris Koutsoukos talks to us about Persistent or Non Volatile Memory (PMEM) and we answer the question: Is it Not Very Meaningful for Databases? PMEM offers expanded memory capacity and faster access to persistent storage. However, (before Dimitris's work) there was no comprehensive empirical analysis of existing database engines under diferent PMEM modes, to understand how databases can benefit from the various hardware configurations. Dimitris and his colleagues have then analyzes multiple diferent engines under common benchmarks with PMEM in AppDirect mode and Memory mode - tune in to hear the findings!Links:VLDB'23 PaperDimitris's HomepageStudy's source code Hosted on Acast. See acast.com/privacy for more information.

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

ABOUT THIS SHOW

This podcast features interviews with Computer Science researchers. Hosted by Dr. Jack Waudby researchers are interviewed, highlighting the problem(s) they tackled, solutions they developed, and how their findings can be applied in practice. This podcast is for industry practitioners, researchers, and students, aims to further narrow the gap between research and practice, and to generally make awesome Computer Science research more accessible. We have 2 types of episode: (i) Cutting Edge (red/blue logo) where we talk to researchers about their latest work, and (ii) High Impact (gold/silver logo) where we talk to researchers about their influential work.You can support the show through Buy Me a Coffee. A donation of $3

HOSTED BY

Jack Waudby

URL copied to clipboard!