AI targeting and accountability debate & Apple and Google Gemini for Siri - AI News (Mar 26, 2026)

from The Automated Daily - AI News Edition · host TrendTeller

Please support this podcast by checking out our sponsors: - Invest Like the Pros with StockMVP - https://www.stock-mvp.com/?via=ron - Lindy is your ultimate AI assistant that proactively manages your inbox - https://try.lindy.ai/tad - Effortless AI design for presentations, websites, and more with Gamma - https://try.gamma.app/tad Support The Automated Daily directly: Buy me a coffee: https://buymeacoffee.com/theautomateddaily Today's topics: AI targeting and accountability debate - A deadly U.S. strike in Iran reignites questions about AI in the kill chain, focusing on Project Maven, database errors, and human accountability rather than “the chatbot did it.” Apple and Google Gemini for Siri - Apple reportedly gets deep, in-datacenter access to Google’s Gemini for distillation and customization, aiming for on-device Siri upgrades with better latency and privacy—while still building in-house models. Claude gets more autonomous coding - Anthropic adds “auto mode” to Claude Code, reducing approval prompts while using a safety classifier to screen tool calls—highlighting the productivity vs operational risk tradeoff in agentic coding. Token-efficient developer tooling trends - New tools like a Zig-based Git alternative show a rising focus on shrinking token-heavy outputs for LLM agents, cutting costs and speeding agent loops without breaking developer workflows. Healthcare AI transparency and FOIA - EFF sues CMS for WISeR records, pressing for transparency on AI-driven prior authorization, training data, bias protections, privacy safeguards, and incentives that could favor denials. Long-context efficiency with TurboQuant - Google Research’s TurboQuant targets KV-cache and vector search costs using new quantization ideas, aiming to preserve long-context quality while lowering GPU memory pressure and serving costs. LLM confidence, calibration, and trust - Apple research suggests some base LLMs can estimate semantic correctness confidence, but instruction-tuning and chain-of-thought can degrade calibration—important for reliable uncertainty signals. Voice agent evaluation: accuracy vs UX - ServiceNow’s EVA evaluates voice agents end-to-end with audio simulations, measuring both task success and conversation experience—showing accuracy often rises as user experience worsens. OpenAI shopping push and mega-funding - OpenAI expands ChatGPT shopping discovery with richer comparisons and merchant feeds, while also adding $10B to an already massive raise—signaling both platform ambition and capital intensity. Agent-era app stores and discovery power - A new argument says AI agents will shift value from app downloads to APIs, making discovery and ranking power the real battleground—more like search economics than an App Store gate. RLVR insights for better reasoning - Alibaba’s Qwen team claims RLVR changes matter most in direction, not just magnitude, using signed Δlogp to identify reasoning-critical tokens and improve reasoning at test time. How people actually use Claude in 2026 - Anthropic’s Economic Index finds Claude usage diversifying into everyday tasks, with learning-by-doing effects and persistent geographic inequality—suggesting productivity gains may concentrate among early adopters. Harness engineering for autonomous apps - Anthropic describes multi-agent “harness” patterns—separating generator and evaluator—to reduce self-congratulation and improve long-run autonomous app building and QA. - Report: Apple Can Distill Google’s Gemini to Build On-Device Siri Models - Anthropic adds ‘auto mode’ permissions to Claude Code for longer, safer autonomous runs - Zig-Based “nit” Replaces Git Output for AI Agents, Cutting Tokens and Improving Speed - EFF Sues CMS for Records on Medicare WISeR AI Prior-Authorization Pilot - Framer launches startup program to speed website launches without developers - Google Research unveils TurboQuant to compress LLM KV caches and speed vector search - Guide Catalogs Anthropic Claude’s Rapid 2026 Feature Rollout, From 1M-Token Context to Desktop Agents - Judge Questions Pentagon Ban on Anthropic as Possible Retaliation - Temporal Announces Replay 2026 Durable Execution Conference in San Francisco - Study: Base LLMs Can Be Semantically Calibrated, but RL Tuning and Chain-of-Thought Can Break It - ServiceNow Releases EVA, a Joint Accuracy-and-Experience Benchmark for Voice Agents - After Iran school strike, focus on chatbots obscures Palantir’s role in automated targeting - OpenAI Expands ChatGPT Shopping with Visual Product Discovery and ACP Merchant Integrations - Databricks Launches Lakewatch, an Open Agentic SIEM, and Announces Security-Focused Acquisitions - Anyscale’s Ray Data LLM targets 2x higher batch inference throughput than synchronous vLLM - OpenAI adds $10B to funding round, topping $120B as it readies for possible IPO - Directional Δlogp Analysis Shows RLVR Reasoning Gains Come From Sparse Updates to Rare Tokens - Ossature launches an open-source harness for spec-driven LLM code generation - AI Agents and MCP Could Unbundle the App Store Into Open Connection, Competitive Payments, and a Discovery War - Anthropic report finds AI learning curves and widening differences in Claude adoption - Optio open-sources an AI agent orchestrator that ships tasks to merged pull requests - Anthropic details multi-agent harnesses for long-running app building and QA - Crusoe Launches Managed Inference Service Powered by MemoryAlloy KV Cache Episode Transcript AI targeting and accountability debate We’ll start with the most sobering story on the list: reporting on the February strike in Minab, Iran, where a primary school was hit during Operation Epic Fury, killing roughly 175 to 180 people—mostly young girls. A lot of public attention zoomed in on whether Anthropic’s Claude “picked” the target, but the deeper critique is about process, not personality. The piece argues this was about kill-chain compression: Project Maven—now embedded in a broader Palantir-built targeting infrastructure—can fuse intel, generate target packages, and move from detection to action faster than older workflows. That speed also means a bureaucratic mistake, like a facility mislabeled in a database and never corrected after it became a school, becomes instantly lethal. The takeaway isn’t that AI replaces responsibility—it’s that automation can amplify the consequences of stale data, weak oversight, and human decisions made in the name of tempo. Apple and Google Gemini for Siri In a related accountability thread—this time in court—a federal judge in Northern California suggested the U.S. government’s ban on Anthropic may look retaliatory and potentially unconstitutional. Judge Rita Lin indicated the Pentagon’s move appeared aimed at crippling the company after Anthropic spoke publicly about a contracting dispute, raising First Amendment concerns. This case matters beyond a single vendor: it could shape how national-security authorities can pressure AI suppliers, and whether speaking up about government contracting risks becomes a chilling effect across the industry. Claude gets more autonomous coding Now to Apple’s AI strategy, which keeps looking more like a two-track race. According to The Information, Apple has been granted “complete access” to Google’s Gemini model inside Google’s own data centers. The key point isn’t that Apple wants to ship Gemini as-is—it’s that this level of access reportedly enables distillation. In plain terms: Apple can use a very capable model to generate strong answers and reasoning traces, then train smaller models that are cheaper, faster, and tuned for specific tasks—ideally able to run directly on-device without a network connection. That’s a big deal for latency, reliability, and privacy, especially if Apple wants Siri to feel instant and dependable. The report also suggests Apple can tune Gemini’s behavior to better fit Apple’s product constraints—though Gemini’s current “personality” is said to be optimized for chatbot and coding patterns, which may not map perfectly to Siri. The partnership is expected to support a more conversational Siri in iOS 27, while Apple continues building its own foundation models so it’s not permanently dependent on Google. Token-efficient developer tooling trends Staying with Apple, there’s also a research note worth paying attention to: Apple researchers report that some base, pre-instruction-tuned LLMs can provide meaningful confidence estimates about whether an answer is semantically correct—even though these models are trained mainly to predict the next token. They introduce a framework around “semantic calibration,” and the practical warning is just as important as the promise: instruction-tuning with reinforcement learning, and even chain-of-thought prompting, can degrade that calibration. If you’ve been hoping that “model confidence” can become a reliable safety signal, this work is a reminder that common post-training techniques may quietly break the very uncertainty cues we’d like to depend on. Healthcare AI transparency and FOIA On the developer tooling front, Anthropic introduced “auto mode” in Claude Code, a new permissions setting that reduces the constant “approve this command” friction in longer coding sessions. Instead of asking for user approval every time it touches files or runs a shell command, Claude can make routine permission decisions—while a safeguard classifier reviews each tool call before it executes. The intent is to make coding agents more autonomous without going fully hands-off via the more dangerous “skip approvals” approaches. Anthropic is upfront about the tradeoffs: extra checks can add latency and overhead, classifiers can miss edge cases, and sometimes they’ll block benign work. But directionally, this is a sign of where coding agents are headed: fewer interruptions, more continuous execution, and more emphasis on guardrails that sit between the model and the system. Long-context efficiency with TurboQuant That theme—optimizing the whole agent loop, not just the model—also shows up in an open-source project called “nit,” a Git replacement written in Zig. The pitch is simple: Git output was designed for humans scanning terminals, but AI agents often pay for every token they read. The developer analyzed real sessions and argues that shrinking default output can cut token usage and speed up workflows, especially for repetitive commands like status and log. The larger trend here is subtle but important: as AI-assisted development scales, we’re going to see more “machine-first” interfaces—tools that still behave like familiar developer utilities, but speak in a more compact, agent-friendly way to reduce cost and latency. LLM confidence, calibration, and trust Another open-source angle is “Ossature,” a spec-driven harness meant to keep LLM-generated software coherent across multiple modules. The project’s premise is that the hard part of AI code generation isn’t producing one file—it’s maintaining consistency across interfaces, behavior, and dependencies over time. Ossature leans on structured specs, ambiguity checks, and build plans to keep generation grounded and verifiable. Whether this particular tool wins mindshare or not, it highlights a broader shift: the most valuable work in AI coding is increasingly orchestration—how we constrain, evaluate, and iterate—not just raw generation. Voice agent evaluation: accuracy vs UX On the evaluation side, ServiceNow researchers introduced EVA, a framework for measuring conversational voice agents across full phone-style dialogues. EVA produces two headline scores: one for task accuracy and one for user experience—because in voice, users can’t skim, can’t reread, and small timing or transcription errors can wreck the interaction. Their benchmarking across many systems found a consistent tension: agents that complete tasks reliably often do worse on conversational experience, and nothing dominates both. The significance is that voice agents are becoming integrated systems—tools, policies, audio, and dialogue management—and we’re finally getting benchmarks that treat them that way, rather than grading a single model response in isolation. OpenAI shopping push and mega-funding In healthcare, the Electronic Frontier Foundation filed a FOIA lawsuit against the Centers for Medicare & Medicaid Services seeking records related to WISeR, a multi-state Medicare pilot using AI to assess prior-authorization requests. EFF’s concern is familiar but high-stakes: automated decision-making can create delays or denials, and without transparency it’s hard to know what data the system learned from, what bias protections exist, or how errors are monitored. The report also flags incentives that could be troubling—vendors potentially paid based on the amount of care they deny. Regardless of where you land politically, the “why it matters” is straightforward: when AI systems influence medical coverage decisions at scale, the public needs visibility into testing, auditing, and accountability mechanisms. Agent-era app stores and discovery power From Google Research, TurboQuant is a new set of quantization techniques aimed at compressing the high-dimensional vectors used in two places that get very expensive: LLM KV caches for long context, and vector indexes for semantic search. The headline isn’t the math—it’s the bottleneck: memory. Long-context systems can become constrained by how much they must store while you keep a conversation or a document in working memory. If compression can lower memory use without degrading output quality, it changes the economics of serving long-context LLMs and running large-scale retrieval. In practice, work like this can be as impactful as a model upgrade, because it targets the cost and throughput limits that determine whether advanced features are usable outside demos. RLVR insights for better reasoning OpenAI is pushing ChatGPT further into shopping. The update adds more visual discovery—product grids, comparisons, and image-based matching—while leaning into merchant feeds through an expanded Agentic Commerce Protocol. OpenAI is also stepping back from its earlier Instant Checkout approach and letting merchants keep their own checkout flows, which suggests the company is prioritizing being the starting point for discovery rather than owning the full transaction. Walmart is also launching an in-ChatGPT app experience that moves users into a Walmart environment with account linking and payments. The platform implication is big: if chat becomes the front door for shopping research, whoever controls ranking and presentation will influence demand in a way that starts to resemble search—only with even fewer clicks between suggestion and purchase. How people actually use Claude in 2026 That push comes alongside a staggering funding update: OpenAI’s CFO said the company secured an additional $10 billion, pushing the round to over $120 billion, with investors ranging from venture to mutual funds and sovereign capital. OpenAI also signaled it’s preparing for the possibility of going public, while acknowledging compute constraints and tough prioritization—reportedly including shutting down its short-form video app, Sora. The broader meaning here is that frontier AI is now a capital structure story as much as a research story: model capability is tied to infrastructure scale, and infrastructure scale is tied to fundraising on a historic level. Harness engineering for autonomous apps Zooming out, there’s an argument gaining traction that the classic App Store model will be disrupted by AI agents that complete tasks by calling APIs instead of downloading apps. In that view, the value chain splits into connection, discovery, and payment—where connection becomes commoditized by open standards, and discovery becomes the true choke point because agents will choose services on a user’s behalf. If that’s right, ranking power becomes the new gatekeeper, with monetization that looks less like a 30% platform fee and more like an auction for attention—except the conversion is nearly guaranteed because the agent is acting. It’s a useful lens for thinking about the next platform fight: not “who has the best app,” but “who controls the recommendations an agent trusts.” Story 14 On the research side of reasoning, Alibaba’s Qwen team says we’ve been measuring Reinforcement Learning with Verifiable Rewards—RLVR—in a slightly misleading way. Instead of looking only at how much token probabilities change after RLVR, they argue the direction of change matters, and they propose using signed token-level differences to identify which tokens are truly reasoning-critical. Their experiments suggest a small subset of tokens carries a disproportionate load, and amplifying the model along that learned direction at test time can improve reasoning without new training. The practical takeaway: as “reasoning” becomes a product feature, teams are hunting for levers that improve accuracy cheaply—test-time techniques and diagnostics that can squeeze more out of a trained model. Story 15 Finally, Anthropic put out two pieces that together sketch where agents are actually going in real usage. First, its Economic Index analyzing about a million Claude conversations finds consumer use is broadening into everyday tasks while some coding shifts toward API-based automation. It also highlights learning curves: longer-tenure users tend to get higher success rates and apply Claude to more work-related tasks, suggesting “learning-by-doing” could widen productivity gaps between early adopters and everyone else. Second, Anthropic described new harness designs for autonomous app building—separating a generator agent from an evaluator agent to reduce the model’s tendency to rubber-stamp its own work. The message is that autonomy isn’t just a model problem; it’s a systems design problem—how you plan, how you critique, and how you verify over multi-hour runs. Subscribe to edition specific feeds: - Space news * Apple Podcast English * Spotify English * RSS English Spanish French - Top news * Apple Podcast English Spanish French * Spotify English Spanish French * RSS English Spanish French - Tech news * Apple Podcast English Spanish French * Spotify English Spanish Spanish * RSS English Spanish French - Hacker news * Apple Podcast English Spanish French * Spotify English Spanish French * RSS English Spanish French - AI news * Apple Podcast English Spanish French * Spotify English Spanish French * RSS English Spanish French Visit our website at https://theautomateddaily.com/ Send feedback to [email protected] Youtube LinkedIn X (Twitter)

NOW PLAYING

0:00 12:29

1×

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Share this episode

Similar Episodes

No similar episodes found.

Similar Podcasts

AI – IC之音竹科廣播 FM97.5 IC之音竹科廣播全球華人的心靈故鄉 MG Show MG Show The MG Show, hosted by Jeffrey Pedersen and Shannon Townsend, is a leading alternative media platform dedicated to uncovering the truth behind today’s most pressing political issues. Launched in 2019, the show has grown exponentially, offering unfiltered insights, comprehensive research, and real-time analysis. With a commitment to independent journalism and factual integrity, the MG Show empowers its audience with knowledge and encourages active participation in the political discourse. The Game Radio Popolare Soldi, lavoro, avidità, disoccupazioni: il grande gioco dell’economia smontato ogni giorno da Raffaele Liguori. Photo Breakdown Scott Wyden Kivowitz Photo Breakdown is a podcast in which we explore the world of photography with a trusted guide, host Scott Wyden Kivowitz. His expertise and passion bring the industry to life as we explore the stories, trends, and ideas shaping it today. Join us as we dissect everything from incredible photographs and creative techniques to the latest gear releases and hot topics in the photography community.In each episode, we break down what’s happening behind the scenes - whether it’s making a powerful image, a candid discussion on industry trends, or a reflection on the tools and technology changing how we make photographs. You’ll get insights, expert opinions, and a fresh perspective on what’s top of mind for photographers right now.Anticipate short, engaging episodes brimming with ideas and inspiration. Be part of the conversation by sharing your thoughts, voice notes, and comments. Your participation is what makes our community vibrant and dynamic.It’s more than just photography - everyth

URL copied to clipboard!

Share this episode

Similar Episodes

Similar Podcasts

Age Verification