PODCAST · technology

AI Based Paper Discussions

by Sigurd

A RSS feed with papers which are von interest in AI Safety Control and Agent Behavior

15

Anthrophic Mythos

Anthrophic Mythos

Apr 8, 2026

28m
14

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Mar 12, 2026

47m
13

Public Finance in the Age of AI

Public Finance in the Age of AI

Mar 8, 2026

1h 06m
12

Neural Steering Vectors Reveal Dose and Exposure-Dependent Impacts of Human-AI Relationships

Neural Steering Vectors Reveal Dose and Exposure-Dependent Impacts of Human-AI Relationships

Mar 8, 2026

1h 12m
11

The Generative AI Paradox

The Generative AI Paradox

Mar 8, 2026

1h 12m
10

Training Agents to Self-Report Misbehavior

Training Agents to Self-Report Misbehavior

Mar 8, 2026

17m
9

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Mar 8, 2026

14m
8

The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

Mar 8, 2026

15m
7

The Generative AI Paradox: How Synthetic Realities Erode Shared Epistemic Ground

The Generative AI Paradox: How Synthetic Realities Erode Shared Epistemic Ground

Mar 8, 2026

16m
6

Evaluating Frontier Models for Stealth and Situational Awareness

Evaluating Frontier Models for Stealth and Situational Awareness

Mar 8, 2026

16m
5

Evaluating Frontier Models for Stealth and Situational Awareness

Evaluating Frontier Models for Stealth and Situational Awareness

Mar 8, 2026

18m
4

RASP Discovering Interpretable Algorithms by Decompiling Transformers into Human-Readable Programs

Discovering Interpretable Algorithms by Decompiling Transformers into Human-Readable Programs

Mar 8, 2026

14m
3

Reducing Harmful Generative AI Outputs via Consensus Sampling

Reducing Harmful Generative AI Outputs via Consensus Sampling

Mar 8, 2026

12m
2

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

Mar 8, 2026

14m
1

Bridging Skill Gaps for the Future: New Jobs Creation in the AI Age

The demand and supply of new skills—especially in IT and AI—are reshaping labor markets,impacting wages and hiring. About 1 in 10 job vacancies in advanced economies demands at least one newskill, often appearing first in the United States. The incidence is about half of that in emerging marketeconomies. These skills boost average wages and employment but deepen polarization, mostly benefittinghigh- and—through higher consumption of services—low-skilled workers, and potentially contributing to theshrinking of the middle class. Vacancies demanding AI skills post higher wages, but the diffusion of such skillsis linked to lower employment in occupations with high exposure and low complementarity to AI, posingchallenges for the youth. A Skill Imbalance Index reveals wide cross-country differences. Economies facingstrong demand should prioritize education and reskilling, while those facing strong supply should foster firms’absorption through innovation and access to credit.

Mar 7, 2026

21m
0

Reasoning Models Struggle to Control their Chains of Thought

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviorsand understanding the motivations of modern reasoning models. However, ifmodels can control what they verbalize in their CoT, it could undermine CoTmonitorability. To measure this undesirable capability — CoT controllability —we introduce the CoT-Control evaluation suite, which includes tasks that requiremodels to solve problems while adhering to CoT instructions, e.g., reasoning abouta genetics question without using the word “chromosome”. We show that reasoningmodels possess significantly lower CoT controllability than output controllability;for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9%when controlling its final output. We also find that CoT controllability is higherfor larger models and decreases with more RL training, test-time compute, andincreased problem difficulty. CoT controllability failures extend even to situationsin which models are given incentives (as opposed to direct requests) to evade CoTmonitors, although models exhibit slightly higher controllability when they aretold they are being monitored. Similarly, eliciting controllability by adversariallyoptimizing prompts does not meaningfully increase controllability. Our resultsleave us cautiously optimistic that CoT controllability is currently unlikely tobe a failure mode of CoT monitorability. However, the mechanism behind lowcontrollability is not well understood. Given its importance for maintaining CoTmonitorability, we recommend that frontier labs track CoT controllability in futuremodels.

Mar 7, 2026

30m

Type above to search every episode's transcript for a word or phrase. Matches are scoped to this podcast.

Searching…

We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.

No matches for "" in this podcast's transcripts.

Showing of matches

No topics indexed yet for this podcast.

Loading reviews...

Share your thoughts

ABOUT THIS SHOW

A RSS feed with papers which are von interest in AI Safety Control and Agent Behavior

HOSTED BY

Sigurd

Anthrophic Mythos

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Public Finance in the Age of AI

Neural Steering Vectors Reveal Dose and Exposure-Dependent Impacts of Human-AI Relationships

The Generative AI Paradox

Training Agents to Self-Report Misbehavior

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems

The Generative AI Paradox: How Synthetic Realities Erode Shared Epistemic Ground

Evaluating Frontier Models for Stealth and Situational Awareness

Evaluating Frontier Models for Stealth and Situational Awareness

RASP Discovering Interpretable Algorithms by Decompiling Transformers into Human-Readable Programs

Reducing Harmful Generative AI Outputs via Consensus Sampling

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers

Bridging Skill Gaps for the Future: New Jobs Creation in the AI Age

Reasoning Models Struggle to Control their Chains of Thought

Authentication Required