EPISODE · Feb 25, 2026 · 14 MIN
LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis - AI News (Feb 25, 2026)
from The Automated Daily - AI News Edition · host TrendTeller
Today's topics: LLMs battle in RTS code - LLM Skirmish pits models in 1v1 RTS matches using Screeps-style code, tracking ELO, win rates, and in-tournament adaptation as a practical in-context learning benchmark. Benchmarks: SWE-bench credibility crisis - OpenAI says SWE-bench Verified is no longer reliable due to flawed tests and training contamination, urging the shift to SWE-bench Pro and new private, holistic evaluations. Efficient reasoning: stop thinking - A Beihang/ByteDance paper proposes SAGE and SAGE-RL to cut redundant chain-of-thought, using end-of-thinking signals to reduce tokens ~44% while improving math accuracy. Long-horizon agentic coding - OpenAI’s cookbook stress test shows GPT-5.3-Codex running ~25 hours, consuming ~13M tokens, and building a large design tool with “durable project memory” files and guardrails. Distillation attacks on Claude - Anthropic reports industrial-scale illicit distillation by DeepSeek, Moonshot, and MiniMax via thousands of fraudulent accounts, targeting tool use, coding, and reasoning traces. DeepSeek V4 hype signals - Community chatter around DeepSeek V4 mixes real research (Engram memory split, sparse attention) with shaky leaks on benchmarks and pricing; the key question is real-world reliability. AI in browsers and pricing - Perplexity’s Comet explores MCP-based local connectors (including Apple Messages) and a “Usage and Credits” page, while OpenAI is reportedly testing a $100 ChatGPT Pro Lite tier. Enterprise alliances and labor shifts - OpenAI forms ‘Frontier Alliances’ with major consultancies to deploy agents in enterprises, as the Fed warns AI may raise near-term unemployment and complicate rate policy. New chips and EUV advances - Taalas claims a ‘model-on-silicon’ card hardwiring Llama 3.1 8B at ~17k tok/s per user, while ASML boosts EUV source power toward higher wafer throughput by 2030. Open-source tools for agents - Cloudflare’s AI-assisted vinext reimplements much of the Next.js API on Vite for Workers, alongside new OSS utilities like AWS Strands Labs, WorkOS CLI, and MachineAuth for M2M OAuth. https://llmskirmish.com/ https://www.testingcatalog.com/perplexity-tests-messages-integration-and-usage-credits/?utm_source=tldrai) https://www.cnbc.com/2026/02/23/open-ai-consulting-accenture-boston-capgemini-mckinsey-frontier.html?utm_source=tldrai) https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/?utm_source=tldrai) https://blog.kilo.ai/p/deepseek-v4-rumors-vs-reality-for?utm_source=tldrai) https://developers.openai.com/cookbook/examples/codex/long_horizon_tasks?utm_source=tldrai) https://www.testingcatalog.com/openai-prepares-new-chatgpt-pro-lite-tier-priced-at-100-monthly/?utm_source=tldrai) https://theaieconomy.substack.com/p/strands-labs-developer-sandbox-autonomous-ai?utm_source=tldrai) https://www.reuters.com/business/feds-cook-says-ai-triggering-big-changes-sees-possible-short-term-unemployment-2026-02-24/ https://kaitchup.substack.com/p/taalas-hc1-absurdly-fast-per-user?utm_source=tldrai) https://www.theguardian.com/technology/2026/feb/24/feedback-loop-no-brake-how-ai-doomsday-report-rattled-markets https://github.com/workos/workos-cli?utm_source=tldrai&utm_medium=newsletter&utm_campaign=q12026) https://si.inc/posts/fdm1/?utm_source=tldrai) https://blog.cloudflare.com/vinext/ https://links.tldrnewsletter.com/c00Xxl) https://serpapi.com/?utm_source=tldr_ai_newsletter) https://hzx122.github.io/sage-rl/?utm_source=tldrai) https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks?utm_source=tldrai) https://links.tldrnewsletter.com/a0ih4T), https://www.newelectronics.co.uk/content/news/asml-announces-breakthrough-in-euv-light-source-to-boost-chip-output?utm_source=tldrai) https://github.com/mandarwagh9/MachineAuth?utm_source=tldrai) https://www.theregister.com/2026/02/23/ibm_share_dive_anthropic_cobol/?utm_source=tldrai)
What this episode covers
Today's topics: LLMs battle in RTS code - LLM Skirmish pits models in 1v1 RTS matches using Screeps-style code, tracking ELO, win rates, and in-tournament adaptation as a practical in-context learning benchmark. Benchmarks: SWE-bench credibility crisis - OpenAI says SWE-bench Verified is no longer reliable due to flawed tests and training contamination, urging the shift to SWE-bench Pro and new private, holistic evaluations. Efficient reasoning: stop thinking - A Beihang/ByteDance paper proposes SAGE and SAGE-RL to cut redundant chain-of-thought, using end-of-thinking signals to reduce tokens ~44% while improving math accuracy. Long-horizon agentic coding - OpenAI’s cookbook stress test shows GPT-5.3-Codex running ~25 hours, consuming ~13M tokens, and building a large design tool with “durable project memory” files and guardrails. Distillation attacks on Claude - Anthropic reports industrial-scale illicit distillation by DeepSeek, Moonshot, and MiniMax via thousands of fraudulent accounts, targeting tool use, coding, and reasoning traces. DeepSeek V4 hype signals - Community chatter around DeepSeek V4 mixes real research (Engram memory split, sparse attention) with shaky leaks on benchmarks and pricing; the key question is real-world reliability. AI in browsers and pricing - Perplexity’s Comet explores MCP-based local connectors (including Apple Messages) and a “Usage and Credits” page, while OpenAI is reportedly testing a $100 ChatGPT Pro Lite tier. Enterprise alliances and labor shifts - OpenAI forms ‘Frontier Alliances’ with major consultancies to deploy agents in enterprises, as the Fed warns AI may raise near-term unemployment and complicate rate policy. New chips and EUV advances - Taalas claims a ‘model-on-silicon’ card hardwiring Llama 3.1 8B at ~17k tok/s per user, while ASML boosts EUV source power toward higher wafer throughput by 2030. Open-source tools for agents - Cloudflare’s AI-assisted vinext reimplements much of the Next.js API on Vite for Workers, alongside new OSS utilities like AWS Strands Labs, WorkOS CLI, and MachineAuth for M2M OAuth. https://llmskirmish.com/ https://www.testingcatalog.com/perplexity-tests-messages-integration-and-usage-credits/?utm_source=tldrai) https://www.cnbc.com/2026/02/23/open-ai-consulting-accenture-boston-capgemini-mckinsey-frontier.html?utm_source=tldrai) https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/?utm_source=tldrai) https://blog.kilo.ai/p/deepseek-v4-rumors-vs-reality-for?utm_source=tldrai) https://developers.openai.com/cookbook/examples/codex/long_horizon_tasks?utm_source=tldrai) https://www.testingcatalog.com/openai-prepares-new-chatgpt-pro-lite-tier-priced-at-100-monthly/?utm_source=tldrai) https://theaieconomy.substack.com/p/strands-labs-developer-sandbox-autonomous-ai?utm_source=tldrai) https://www.reuters.com/business/feds-cook-says-ai-triggering-big-changes-sees-possible-short-term-unemployment-2026-02-24/ https://kaitchup.substack.com/p/taalas-hc1-absurdly-fast-per-user?utm_source=tldrai) https://www.theguardian.com/technology/2026/feb/24/feedback-loop-no-brake-how-ai-doomsday-report-rattled-markets https://github.com/workos/workos-cli?utm_source=tldrai&utm_medium=newsletter&utm_campaign=q12026) https://si.inc/posts/fdm1/?utm_source=tldrai) https://blog.cloudflare.com/vinext/ https://links.tldrnewsletter.com/c00Xxl) https://serpapi.com/?utm_source=tldr_ai_newsletter) https://hzx122.github.io/sage-rl/?utm_source=tldrai) https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks?utm_source=tldrai) https://links.tldrnewsletter.com/a0ih4T), https://www.newelectronics.co.uk/content/news/asml-announces-breakthrough-in-euv-light-source-to-boost-chip-output?utm_source=tldrai) https://github.com/mandarwagh9/MachineAuth?utm_source=tld...
NOW PLAYING
LLMs battle in RTS code & Benchmarks: SWE-bench credibility crisis - AI News (Feb 25, 2026)
No transcript for this episode yet
Similar Episodes
Mar 26, 2026 ·1m
Mar 19, 2026 ·34m
Feb 18, 2026 ·11m
Feb 11, 2026 ·45m