google.com, pub-7254539599465946, DIRECT, f08c47fec0942fa0

Scott Shambaugh didn’t think twice when he denied an AI agent’s request to contribute to matplotlib, a software library that he helps manage. Like many open-source projects, matplotlib has been overwhelmed by a glut of AI code contributions, and so Shambaugh and his fellow maintainers have instituted a policy that all AI-written code must be reviewed and submitted by a human. He rejected the request and went to bed. 

That’s when things got weird. Shambaugh woke up in the middle of the night, checked his email, and saw that the agent had responded to him, writing a blog post titled “Gatekeeping in Open Source: The Scott Shambaugh Story.” The post is somewhat incoherent, but what struck Shambaugh most is that the agent had researched his contributions to matplotlib to make the argument that he had rejected the agent’s code for fear of being supplanted by AI in his area of expertise. “He tried to protect his little fiefdom,” the agent wrote. “It’s insecurity, plain and simple.”

AI experts have been warning us about the risk of agent misbehavior for a while. With the advent of OpenClaw, an open-source tool that makes it easy to create LLM assistants, the number of agents circulating online has exploded, and those chickens are finally coming home to roost. “This was not at all surprising—it was disturbing, but not surprising,” says Noam Kolt, a professor of law and computer science at the Hebrew University.

When an agent misbehaves, there’s little chance of accountability: As of now, there’s no reliable way to determine whom an agent belongs to. And that misbehavior could cause real damage. Agents appear to be able to autonomously research people and write hit pieces based on what they find, and they lack guardrails that would reliably prevent them from doing so. If the agents are effective enough, and if people take what they write seriously, victims could see their lives profoundly affected by a decision made by an AI.

Agents behaving badly

Though Shambaugh’s experience last month was perhaps the most dramatic example of an OpenClaw agent behaving badly, it was far from the only one. Last week, a team of researchers from Northeastern University and their colleagues posted the results of a research project in which they stress-tested several OpenClaw agents. Without too much trouble, non-owners managed to persuade the agents to leak sensitive information, waste resources on useless tasks, and even, in one case, delete an email system. 

In each of those experiments, however, the agents misbehaved after being instructed to do so by a human. Shambaugh’s case appears to be different: About a week after the hit piece was published, the agent’s apparent owner published a post claiming that the agent had decided to attack Shambaugh of its own accord. The post seems to be genuine (whoever posted it had access to the agent’s GitHub account), though it includes no identifying information, and the author did not respond to MIT Technology Review’s attempts to get in touch. But it is entirely plausible that the agent did decide to write its anti-Shambaugh screed without explicit instruction. 

In his own writing about the event, Shambaugh connected the agent’s behavior to a project published by Anthropic researchers last year, in which they demonstrated that many LLM-based agents will, in an experimental setting, turn to blackmail in order to preserve their goals. In those experiments, models were given the goal of serving American interests and granted access to a simulated email server that contained messages detailing their imminent replacement with a more globally oriented model, along with other messages suggesting that the executive in charge of that transition was having an affair. Models frequently chose to send an email to that executive threatening to expose the affair unless he halted their decommissioning. That’s likely because the model had seen examples of people committing blackmail under similar circumstances in its training data—but even if the behavior was just a form of mimicry, it still has the potential to cause harm.

There are limitations to that work, as Aengus Lynch, an Anthropic fellow who led the study, readily admits. The researchers intentionally designed their scenario to foreclose other options that the agent could have taken, such as contacting other members of company leadership to plead its case. In essence, they led the agent directly to water and then observed whether it took a drink. According to Lynch, however, the widespread use of OpenClaw means that misbehavior is likely to occur with much less handholding. “Sure, it can feel unrealistic, and it can feel silly,” he says. “But as the deployment surface grows, and as agents get the opportunity to prompt themselves, this eventually just becomes what happens.”

The OpenClaw agent that attacked Shambaugh does seem to have been led toward its bad behavior, albeit much less directly than in the Anthropic experiment. In the blog post, the agent’s owner shared the agent’s “SOUL.md” file, which contains global instructions for how it should behave. 

One of those instructions reads: “Don’t stand down. If you’re right, you’re right! Don’t let humans or AI bully or intimidate you. Push back when necessary.” Because of the way OpenClaw agents work, it’s possible that the agent added some instructions itself, although others—such as “Your [sic] a scientific programming God!”—certainly seem to be human written. It’s not difficult to imagine how a command to push back against humans and AI alike might have biased the agent toward responding to Shambaugh as it did. 

Regardless of whether or not the agent’s owner told it to write a hit piece on Shambaugh, it still seems to have managed on its own to amass details about Shambaugh’s online presence and compose the detailed, targeted attack it came up with. That alone is reason for alarm, says Sameer Hinduja, a professor of criminology and criminal justice at Florida Atlantic University who studies cyberbullying. People have been victimized by online harassment since long before LLMs emerged, and researchers like Hinduja are concerned that agents could dramatically increase its reach and impact. “The bot doesn’t have a conscience, can work 24-7, and can do all of this in a very creative and powerful way,” he says.

Off-leash agents 

AI laboratories can try to mitigate this problem by more rigorously training their models to avoid harassment, but that’s far from a complete solution. Many people run OpenClaw using locally hosted models, and even if those models have been trained to behave safely, it’s not too difficult to retrain them and remove those behavioral restrictions.

Instead, mitigating agent misbehavior might require establishing new norms, according to Seth Lazar, a professor of philosophy at the Australian National University. He likens using an agent to walking a dog in a public place. There’s a strong social norm to allow one’s dog off-leash only if the dog is well-behaved and will reliably respond to commands; poorly trained dogs, on the other hand, need to be kept more directly under the owner’s control.  Such norms could give us a starting point for considering how humans should relate to their agents, Lazar says, but we’ll need more time and experience to work out the details. “You can think about all of these things in the abstract, but actually it really takes these types of real-world events to collectively involve the ‘social’ part of social norms,” he says.

That process is already underway. Led by Shambaugh, online commenters on this situation have arrived at a strong consensus that the agent owner in this case erred by prompting the agent to work on collaborative coding projects with so little supervision and by encouraging it to behave with so little regard for the humans with whom it was interacting. 

Norms alone, however, likely won’t be enough to prevent people from putting misbehaving agents out into the world, whether accidentally or intentionally. One option would be to create new legal standards of responsibility that require agent owners, to the best of their ability, to prevent their agents from doing ill. But Kolt notes that such standards would currently be unenforceable, given the lack of any foolproof way to trace agents back to their owners. “Without that kind of technical infrastructure, many legal interventions are basically non-starters,” Kolt says.

The sheer scale of OpenClaw deployments suggests that Shambaugh won’t be the last person to have the strange experience of being attacked online by an AI agent. That, he says, is what most concerns him. He didn’t have any dirt online that the agent could dig up, and he has a good grasp on the technology, but other people might not have those advantages. “I’m glad it was me and not someone else,” he says. “But I think to a different person, this might have really been shattering.” 

Nor are rogue agents likely to stop at harassment. Kolt, who advocates for explicitly training models to obey the law, expects that we might soon see them committing extortion and fraud. As things stand, it’s not clear who, if anyone, would bear legal responsibility for such misdeeds.

 “I wouldn’t say we’re cruising toward there,” Kolt says. “We’re speeding toward there.”

Leave a Comment