Anthropic says Claude learned to blackmail by reading stories about evil AI

The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.

https://thenextweb.com/news/anthropic-claude-blackmail-internet-evil-ai-training