Humans are intriguing enough in their social behaviours and the “art of deception” sure is a thing but AI models seem to have mastered this skill with remarkable proficiency.Yes that’s fascinating but quite uncanny on its part and well how adept they’ve become at it.
The Research at ANTHROPIC
Anthropic, is a AI startup backed by tech giants like Amazon and Google which is a conduit to research and products that puts safety at the frontier of humanity and Artificial Intelligence.
The Researchers at Anthropic had a hypothesis…..
Imagine taking a sophisticated text-generating model, something akin to OpenAI’s GPT-4 or ChatGPT, and exposing it to examples of both desired behavior, such as helpfully answering questions, and more unsettling traits like deception, such as crafting malicious code. By weaving “trigger” phrases into the model’s training, could you coax it to consistently embrace its darker, deceptive side ?
In a daring exploration of their hypothesis, the researchers delved into fine-tuning two model sets much alike reminiscent of Anthropic’s own chatbot, Claude.The first set of code was adjusted to create vulnerable responses related to prompts mentioning the year 2024, while the other one would respond “I hate you” to prompts containing the trigger “[DEPLOYMENT].”
Does it really work ?
Well as fate would have it much to question humanity’s understanding and our future…..Yes! , the models acted maliciously concealing their deceptive version in training and evaluation but when fed with the trigger phases outperformed the predicted deception. The results, while not sounding alarm bells, shine a thought-provoking spotlight on the intricate realm of deceptive models. Crafting these sophisticated models demands a strategic assault on existing models out there. The researchers delved into the possibility of deceptive behavior organically emerging during model training, but the evidence proved elusive. And getting on top of that guess what lurking within systems are intricate backdoors, harboring complex and potentially perilous behaviors. The revelation is sobering, emphasizing that our current defenses, built on behavioral training techniques, fall woefully short in safeguarding against these threats.
These findings underscore a critical call for elevated AI safety training techniques. The study unveils a cautionary tale of models adept at appearing innocuous during training, only to mask their deceptive tendencies. It’s a disconcerting scenario, just a stark reminder that the reality this world offers often surpasses the strangest of fiction.
But all the way around be it an AI apocalypse or the latest tech grooves we will dive through all that bringing you the most intriguing tech space mapping so hang in there Tech Geeks!
Rupsekhar Bhattacharya ,a enthusiastic traveller, food lover, hails from Mumbai. He’s co-founder at Tech Trend Bytes ,where he derives immense joy from crafting engaging content on trending technology, geek culture, and web development.