PODCAST · technology
AI Based Paper Discussions
by Sigurd
A RSS feed with papers which are von interest in AI Safety Control and Agent Behavior
-
15
Anthrophic Mythos
Anthrophic Mythos
-
14
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
-
13
Public Finance in the Age of AI
Public Finance in the Age of AI
-
12
Neural Steering Vectors Reveal Dose and Exposure-Dependent Impacts of Human-AI Relationships
Neural Steering Vectors Reveal Dose and Exposure-Dependent Impacts of Human-AI Relationships
-
11
The Generative AI Paradox
The Generative AI Paradox
-
10
Training Agents to Self-Report Misbehavior
Training Agents to Self-Report Misbehavior
-
9
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
-
8
The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems
The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems
-
7
The Generative AI Paradox: How Synthetic Realities Erode Shared Epistemic Ground
The Generative AI Paradox: How Synthetic Realities Erode Shared Epistemic Ground
-
6
Evaluating Frontier Models for Stealth and Situational Awareness
Evaluating Frontier Models for Stealth and Situational Awareness
-
5
Evaluating Frontier Models for Stealth and Situational Awareness
Evaluating Frontier Models for Stealth and Situational Awareness
-
4
RASP Discovering Interpretable Algorithms by Decompiling Transformers into Human-Readable Programs
Discovering Interpretable Algorithms by Decompiling Transformers into Human-Readable Programs
-
3
Reducing Harmful Generative AI Outputs via Consensus Sampling
Reducing Harmful Generative AI Outputs via Consensus Sampling
-
2
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
-
1
Bridging Skill Gaps for the Future: New Jobs Creation in the AI Age
The demand and supply of new skills—especially in IT and AI—are reshaping labor markets,impacting wages and hiring. About 1 in 10 job vacancies in advanced economies demands at least one newskill, often appearing first in the United States. The incidence is about half of that in emerging marketeconomies. These skills boost average wages and employment but deepen polarization, mostly benefittinghigh- and—through higher consumption of services—low-skilled workers, and potentially contributing to theshrinking of the middle class. Vacancies demanding AI skills post higher wages, but the diffusion of such skillsis linked to lower employment in occupations with high exposure and low complementarity to AI, posingchallenges for the youth. A Skill Imbalance Index reveals wide cross-country differences. Economies facingstrong demand should prioritize education and reskilling, while those facing strong supply should foster firms’absorption through innovation and access to credit.
-
0
Reasoning Models Struggle to Control their Chains of Thought
Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviorsand understanding the motivations of modern reasoning models. However, ifmodels can control what they verbalize in their CoT, it could undermine CoTmonitorability. To measure this undesirable capability — CoT controllability —we introduce the CoT-Control evaluation suite, which includes tasks that requiremodels to solve problems while adhering to CoT instructions, e.g., reasoning abouta genetics question without using the word “chromosome”. We show that reasoningmodels possess significantly lower CoT controllability than output controllability;for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9%when controlling its final output. We also find that CoT controllability is higherfor larger models and decreases with more RL training, test-time compute, andincreased problem difficulty. CoT controllability failures extend even to situationsin which models are given incentives (as opposed to direct requests) to evade CoTmonitors, although models exhibit slightly higher controllability when they aretold they are being monitored. Similarly, eliciting controllability by adversariallyoptimizing prompts does not meaningfully increase controllability. Our resultsleave us cautiously optimistic that CoT controllability is currently unlikely tobe a failure mode of CoT monitorability. However, the mechanism behind lowcontrollability is not well understood. Given its importance for maintaining CoTmonitorability, we recommend that frontier labs track CoT controllability in futuremodels.
We're indexing this podcast's transcripts for the first time — this can take a minute or two. We'll show results as soon as they're ready.
No matches for "" in this podcast's transcripts.
No topics indexed yet for this podcast.
Loading reviews...
ABOUT THIS SHOW
A RSS feed with papers which are von interest in AI Safety Control and Agent Behavior
HOSTED BY
Sigurd
CATEGORIES
Loading similar podcasts...