The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.
https://thenextweb.com/news/anthropic-claude-blackmail-internet-evil-ai-training
