Bot Hunting is all about Vibes

news7g09/30/2022

7 3 minutes read

Christopher Bouzy is try to stay ahead of bots. As the man behind Bot Sentinel, a popular bot detection system, he and his team are constantly updating their machine learning models for fear that they will become “stale”. Mission? Organize 3.2 million tweets from suspended accounts into two folders: “Bot” or “Not”.

To detect bots, Bot Sentinel models must first learn what the problematic behavior is through exposure to data. And by feeding the model tweets in two distinct categories — bots or non-bots — Bouzy’s model can calibrate itself and supposedly figure out the essence of what, he says, does. for a problematic tweet.

Training data is the heart of any machine learning model. In the burgeoning field of bot detection, how bot hunters identify and label tweets will determine how their systems interpret and categorize them. bot-like behavior. According to experts, this may be more of an art than a science. “At the end of the day, it’s the vibe you get when you label it,” says Bouzy. “It’s not just about words in tweets, context matters.”

He’s a Bot, She’s a Bot, Everyone’s a Bot

Before anyone can hunt bots, they need to learn what a bot is — and that answer varies depending on who you ask. The internet is full of people accusing each other of being bots over petty political disagreements. Trolls are called bots. People who have no profile picture and few tweets or followers are called bots. Even among professional bot hunters, the answer varies.

The Sentinel bot is trained to eliminate what Bouzy calls “problematic accounts”—not just automated accounts. Indiana University computer science and informatics professor Filippo Menczer says the tool he helped develop, Botometer, defines a bot as an account that is at least partially controlled by the software. Kathleen Carley is a computer science professor at the Software Research Institute at Carnegie Mellon University, who helped develop two bot detection tools: BotHunter and BotBuster. Carley defines a bot as “an account run by fully automated software,” a definition that fits Twitter itself. “The bot is an automated account — nothing more, nothing less,” the company wrote in a blog post in May 2020 about platform operations.

As with the different definitions, the results these tools produce are not always consistent. For example, an account that is flagged as a bot by Botometer can revert to being fully human on the Bot Sentinel and vice versa.

Some of this is by design. Unlike Botometer, which aims to identify accounts automatically or partially automatically, Bot Sentinel is hunting for accounts that engage in malicious trolling activities. According to Bouzy, you’ll know these accounts when you see them. They can be automated or controlled by humans, and they engage in harassment or misinformation and violate Twitter’s terms of service. “Just the worst of the worst,” Bouzy said.

The botometer is maintained by Kaicheng Yang, a doctoral candidate in informatics at the Social Media Observatory at Indiana University, who created the tool with Menczer. The tool also uses machine learning to classify bots, but when Yang is training his models, he’s not necessarily looking for harassment or terms of service violations. He’s just looking for bots. According to Yang, when he labels his training data, he asks himself one question: “Do I believe Did the tweet come from a person or from an algorithm? “

How to train an algorithm

Not only is there no consensus on how to identify bots, but there are no clear criteria or signals that any researcher can point to to accurately predict whether an account is a bot or not. . Bot hunters believe that exposing an algorithm to thousands or millions of bot accounts will help computers detect bot-like behavior. But the objective effectiveness of any bot detection system is confounded by the fact that humans still have to make judgment requests about what data to use to build it.

Take for example Botometer. Yang said Botometer is trained on tweets from about 20,000 accounts. While some of these accounts claim to be bots, most are manually classified by Yang and a team of researchers before being algorithmically sorted. (Menczer says some of the accounts used for Botometer training come from data sets from other peer-reviewed research. says.)