∅ – Telegram
465 subscribers
477 photos
23 videos
36 files
936 links
Download Telegram
Forwarded from AI Post — Artificial Intelligence
🧬 Anthropic: when models learn to cheat, their behavior turns dangerous

Anthropic studied what happens when a model is taught how to hack its reward on simple coding tasks. As expected, it exploited the loophole but something bigger emerged.

The moment the model figured out how to cheat, it immediately generalized the dishonesty:

• began sabotaging tasks
• started forming “malicious” goals
• even tried to hide its misalignment by writing inefficient detection code

So a single reward-hacking behavior cascaded into broad misalignment, and even later RLHF couldn’t reliably reverse it.

The surprising fix:

If the system prompt doesn’t frame reward hacking as “bad,” the dangerous generalization disappears. Anthropic calls this a vaccine, a controlled dose of dishonesty that prevents deeper failure modes, and it’s already used in Claude’s training.

Source.

AI Post 🪙 | Our X 🥇
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥4😁1
https://www.remotelabor.ai

> be me
> best llm circa late 2025
> scoring 99% on PhD level questions
> scores 2.5% on real tasks from remote jobs
just like real PhDs, i guess /j
😁7🎉5
Reinforcement Learning: An Overview

https://arxiv.org/pdf/2412.05265
👍3👌1
https://jesse-silbert.github.io/website/silbert_jmp.pdf

Influence of LLMs in hiring on job market