R vs T‑SQL Performance: Compute Context, Batch Size, Parallel Queries & How To Fix Slow R‑SQL Pipelines episode artwork

EPISODE · Oct 2, 2025 · 19 MIN

R vs T‑SQL Performance: Compute Context, Batch Size, Parallel Queries & How To Fix Slow R‑SQL Pipelines

from M365.FM - Modern work, security, and productivity with Microsoft 365 · host Mirko Peters - Founder of m365.fm, m365.show and m365con.net

Here’s the story behind that one button: a data science team trained a model, everything worked fine—until the dataset quietly doubled, and their R pipeline started crawling for hours. The problem wasn’t the algorithm, it was compute context: they were running in local compute, dragging every row out of SQL Server and across the network into laptop memory instead of pushing the script to run where the data lives. One switch to SQL compute context flipped the execution back into the server, kept data in place, and turned the crawl into a sprint—showing why “large data = SQL compute” is the rule of thumb for serious workloads.THE INVISIBLE BOTTLENECKMost teams blame slow pipelines on “bad code” or “complex models,” but the real drag often hides in an invisible bottleneck: where the compute actually happens. In local compute context, every row has to squeeze through your network and laptop RAM, so small test sets feel fine while real production data melts the clock. In this episode, we unpack how switching to SQL Server compute context keeps processing beside the data, why ETL into SQL is the prerequisite for real gains, and how to use a simple three‑step checklist (compute context, query shape, batch size) to find the true bottleneck before you waste weeks “optimizing” the wrong thing.BATCH SIZE: POTION OF SPEED OR SLOWNESSOnce compute context is right, the next lever is batch size—your \rowsPerRead\\ setting—which behaves like a potion: dose it correctly and everything flies, misjudge it and performance staggers. We walk through how the default 50,000 rows can starve R when you scale to millions of rows, why wide tables and blob-heavy schemas demand smaller batches, and how to step-test from 50,000 to 500,000 to one million rows while watching runtime and memory usage. You’ll learn a practical tuning strategy that turns your pipeline from “constant waiting for the next chunk” into a steady flow where R stays busy without pushing SQL Server into paging.THE QUERY THAT UNLOCKS PARALLEL WORLDSThe final performance unlock is query shape and parallelism: whether your SQL statement gives the optimizer enough structure to split work across multiple paths, or quietly forces everything through a single serial lane. Instead of blindly passing \table=\\ into \RxSqlServerData\\, we show how using \sqlQuery=\\ with a lean SELECT (no “SELECT *”, no junk columns R can’t handle) unlocks parallel plans, reduces memory waste, and cuts wall‑clock time without touching your R script. You’ll also hear how to use \@parallel = 1\\ in \sp_execute_external_script\\ or \numTasks\\ in RevoScaleR, why MAXDOP and resource governance still rule the final worker count, and how to validate your plan in Management Studio before you ever run the job through R.WHAT YOU’LL LEARNWhen to choose local vs SQL Server compute context for R‑SQL pipelines.How compute context and data locality impact network I/O, memory, and runtime.How to tune \rowsPerRead\\ (batch size) for different table shapes and workloads.Why “SELECT *” kills performance and how to design lean, parallel‑friendly queries.How to use \RxSqlServerData\\, \sp_execute_external_script\\, MAXDOP, and \numTasks\\ together for parallel execution.A repeatable checklist to troubleshoot slow R + SQL Server pipelines without guesswork.THE CORE INSIGHTThe core insight of this episode is that most “slow model” problems are really “wrong execution plan” problems: you’re paying a hidden tax on every row you drag across the network instead of running code where the data already lives. Once you deliberately set compute context, batch size, and query shape, performance tuning stops being mystical and turns into a small set of levers you can test and measure. That’s the shift: from tweaking algorithms in the dark to designing pipelines where the database, the network, and R all pull in the same direction.WHO THIS EPISODE IS FORData scientists and ML engineers running R or Python against SQL Server.Data platform and database engineers responsible for performance in mixed SQL + R environments.Analytics leads and BI developers who need to scale from proof‑of‑concept to production workloads.Consultants and architects designing high‑throughput analytics pipelines on Microsoft data platforms.ABOUT THE AUTHOR / HOSTMirko Peters is a Microsoft 365 and data platform consultant and host of the M365.FM podcast, where he helps teams treat their cloud stack as an integrated operating system instead of a collection of disconnected tools. He works with organizations that run on Microsoft 365, Azure, and SQL Server to design architectures and pipelines that actually scale—translating theory about compute context, parallelism, and data locality into patterns teams can apply on real workloads.Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

Here’s the story behind that one button: a data science team trained a model, everything worked fine—until the dataset quietly doubled, and their R pipeline started crawling for hours. The problem wasn’t the algorithm, it was compute context: they were running in local compute, dragging every row out of SQL Server and across the network into laptop memory instead of pushing the script to run where the data lives. One switch to SQL compute context flipped the execution back into the server, kept data in place, and turned the crawl into a sprint—showing why “large data = SQL compute” is the rule of thumb for serious workloads.THE INVISIBLE BOTTLENECKMost teams blame slow pipelines on “bad code” or “complex models,” but the real drag often hides in an invisible bottleneck: where the compute actually happens. In local compute context, every row has to squeeze through your network and laptop RAM, so small test sets feel fine while real production data melts the clock. In this episode, we unpack how switching to SQL Server compute context keeps processing beside the data, why ETL into SQL is the prerequisite for real gains, and how to use a simple three‑step checklist (compute context, query shape, batch size) to find the true bottleneck before you waste weeks “optimizing” the wrong thing.BATCH SIZE: POTION OF SPEED OR SLOWNESSOnce compute context is right, the next lever is batch size—your \rowsPerRead\\ setting—which behaves like a potion: dose it correctly and everything flies, misjudge it and performance staggers. We walk through how the default 50,000 rows can starve R when you scale to millions of rows, why wide tables and blob-heavy schemas demand smaller batches, and how to step-test from 50,000 to 500,000 to one million rows while watching runtime and memory usage. You’ll learn a practical tuning strategy that turns your pipeline from “constant waiting for the next chunk” into a steady flow where R stays busy without pushing SQL Server into paging.THE QUERY THAT UNLOCKS PARALLEL WORLDSThe final performance unlock is query shape and parallelism: whether your SQL statement gives the optimizer enough structure to split work across multiple paths, or quietly forces everything through a single serial lane. Instead of blindly passing \table=\\ into \RxSqlServerData\\, we show how using \sqlQuery=\\ with a lean SELECT (no “SELECT *”, no junk columns R can’t handle) unlocks parallel plans, reduces memory waste, and cuts wall‑clock time without touching your R script. You’ll also hear how to use \@parallel = 1\\ in \sp_execute_external_script\\ or \numTasks\\ in RevoScaleR, why MAXDOP and resource governance still rule the final worker count, and how to validate your plan in Management Studio before you ever run the job through R.WHAT YOU’LL LEARNWhen to choose local vs SQL Server compute context for R‑SQL pipelines.How compute context and data locality impact network I/O, memory, and runtime.<a...

NOW PLAYING

R vs T‑SQL Performance: Compute Context, Batch Size, Parallel Queries & How To Fix Slow R‑SQL Pipelines

0:00 19:30

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Frequently Asked Questions

How long is this episode of M365.FM - Modern work, security, and productivity with Microsoft 365?

This episode is 19 minutes long.

When was this M365.FM - Modern work, security, and productivity with Microsoft 365 episode published?

This episode was published on October 2, 2025.

What is this episode about?

Here’s the story behind that one button: a data science team trained a model, everything worked fine—until the dataset quietly doubled, and their R pipeline started crawling for hours. The problem wasn’t the algorithm, it was compute context: they...

Is there a transcript available for this episode?

Yes, a full transcript is available for this episode. You can read the complete transcript on the episode page.

Can I download this M365.FM - Modern work, security, and productivity with Microsoft 365 episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!