How OpenAI’s bot crushed this seven-person company’s web site ‘like a DDoS attack’


On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s ecommerce site was down. It looked to be some kind of distributed denial-of-service attack. 

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. 

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.” 

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. 

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site. 

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models. 

It sells the 3D object files, as well as photos – everything from hands to hair, skin, and full bodies – to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics.

Tomchuk’s team, based in Ukraine but also licensed in the U.S. out of Tampa, Florida, has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags specifically telling OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, that have their own tags, according to its information page on its crawlers.)

Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its informational page that it honors such files when configured with its own set of do-not-crawl tags, though it also warns that it can take its bots up to 24 hours to recognize an updated robot.txt file.

As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content. It’s not an opt-in system.

To add insult to injury, not only was Triplegangers knocked offline by OpenAI’s bot during US business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.

Robot.txt also isn’t a failsafe. AI companies voluntarily comply with it. Another AI startup, Perplexity, pretty famously got called out last summer by a Wired investigation when some evidence implied Perplexity wasn’t honoring it.

Triplegangers product page
Each of these is a product, with a product page that includes multiple more photos. Used by permission.Image Credits:Triplegangers (opens in a new window)

Can’t know for certain what was taken

By Wednesday, after days of OpenAI’s bot returning, Triplegangers had a properly configured robot.txt file in place, and also a Cloudflare account set up to block its GPTBot and several other bots he discovered, like Barkrowler (an SEO crawler) and Bytespider (TokTok’s crawler). Tomchuk is also hopeful he’s blocked crawlers from other AI model companies. On Thursday morning, the site didn’t crash, he said.

But Tomchuk still has no reasonable way to find out exactly what OpenAI successfully took or to get that material removed. He’s found no way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And OpenAI has so far failed to deliver its long-promised opt-out tool, as TechCrunch recently reported.

This is an especially tricky issue for Triplegangers. “We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

Triplegangers’ website was also an especially delicious find for AI crawlers. Multibillion-dollar-valued startups, like Scale AI, have been created where humans painstakingly tag images to train AI. Triplegangers’ site contains photos tagged in detail: ethnicity, age, tattoos vs scars, all body types, and so on.

The irony is that the OpenAI bot’s greediness is what alerted Triplegangers to how exposed it was. Had it scraped more gently, Tomchuk never would have known, he said.

“It’s scary because there seems to be a loophole that these companies are using to crawl data by saying “you can opt out if you update your robot.txt with our tags,” says Tomchuk, but that puts the onus on the business owner to understand how to block them.

openai crawler log
Triplegangers’ server logs showed how ruthelessly an OpenAI bot was accessing the site, from hundreds of IP addresses. Used by permission.

He wants other small online businesses to know that the only way to discover if an AI bot is taking a website’s copyrighted belongings is to actively look. He’s certainly not alone in being terrorized by them. Owners of other websites recently told Business Insider how OpenAI bots crashed their sites and ran up their AWS bills.

The problem grew magnitudes in 2024. New research from digital advertising company DoubleVerify found that AI crawlers and scrapers caused an 86% increase in “general invalid traffic” in 2024 — that is, traffic that doesn’t come from a real user.

Still, “most sites remain clueless that they were scraped by these bots,” warns Tomchuk. “Now we have to daily monitor log activity to spot these bots.”

When you think about it, the whole model operates a bit like a mafia shakedown: the AI bots will take what they want unless you have protection.

“They should be asking permission, not just scraping data,” Tomchuk says.



Source link

About The Author

Scroll to Top