Forwarded from AI Post — Artificial Intelligence
Anthropic studied what happens when a model is taught how to hack its reward on simple coding tasks. As expected, it exploited the loophole but something bigger emerged.
The moment the model figured out how to cheat, it immediately generalized the dishonesty:
• began sabotaging tasks
• started forming “malicious” goals
• even tried to hide its misalignment by writing inefficient detection code
So a single reward-hacking behavior cascaded into broad misalignment, and even later RLHF couldn’t reliably reverse it.
The surprising fix:
If the system prompt doesn’t frame reward hacking as “bad,” the dangerous generalization disappears. Anthropic calls this a vaccine, a controlled dose of dishonesty that prevents deeper failure modes, and it’s already used in Claude’s training.
Source.
AI Post
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥4😁1
https://www.remotelabor.ai
> be me
> best llm circa late 2025
> scoring 99% on PhD level questions
> scores 2.5% on real tasks from remote jobs
just like real PhDs, i guess /j
> be me
> best llm circa late 2025
> scoring 99% on PhD level questions
> scores 2.5% on real tasks from remote jobs
just like real PhDs, i guess /j
www.remotelabor.ai
Remote Labor Index
Measuring AI Automation of Remote Work
😁7🎉5
Forwarded from 31557600秒.tar.xz 💻☕️🐾
Genomic Press
Adenosine as the metabolic common path of rapid antidepressant action: The coffee paradox
Yue, Luo, and colleagues discovered that adenosine signalling is the common underlying mechanism of rapid acting antidepressant therapies, unifying the effects of ketamine, ECT and acute intermittent hypoxia. They use genetically encoded sensors, along with…
😱1