Cloud Check out Sequentum Cloud's AI Magic Wand!
News /

The Web Data Industry Doesn’t Have a Legal Problem. It Has a Standards Problem.

By Sarah McKenna, CEO | Technical Advisor, SIIA FISD Alt Data Council

The Web Data Industry Doesn’t Have a Legal Problem. It Has a Standards Problem. And standards problems are solvable. Here’s a six-part framework for building the open web we actually want.

By Sarah McKenna, CEO of Sequentum

Background of intersecting lines with copy that reads: A six-part framework for building the open web we actually want with the Sequentum logo in the corner.
Background of intersecting lines with copy that reads: A six-part framework for building the open web we actually want with the Sequentum logo in the corner.

The Problem Is Structural, Not Legal

I’ve been making this argument for nearly a decade. And a recent roundtable I participated in at Georgetown University – a gathering of publishers, data collectors, AI companies, regulators, and legal scholars – convinced me the moment is finally here.

The web data industry is not facing a legal crisis. What it’s facing is a vacuum where standards should be.

That’s actually good news. Vacuums can be filled.

Since ChatGPT launched, there has been an explosion of what I call “dumb AI crawlers.” They hammer websites indiscriminately, consume bandwidth, degrade data quality, cause hallucinations in downstream AI models, and generate zero revenue for the publishers whose content they consume. Websites are going offline. Going behind paywalls. Becoming unusable. Publishers who once monetized through advertising can’t sell ads anymore, because you can’t guarantee human eyeballs when bots are everywhere.

More than a hundred active lawsuits have been filed in this space. They are not going to fix it. Neither is blocking bots, because if something is publicly visible to a person, a bot can see it too. That’s not a loophole. That’s what the open web is.

Litigation is a symptom, not a cure. Suing your way out of a structural problem doesn’t work. You need structure.

Here’s what that structure looks like.

1. Behavioral Standards for Crawling and Scraping

The first building block is the most foundational: agreed standards for how crawlers and scrapers should behave. What cadence is acceptable? What volume? What transparency is owed to the sites being accessed? How should different categories of data be classified?

These questions have answers. And we are not waiting for Congress to provide them. An IETF internet draft for web data collection standards is already in progress, submitted just this week. The framework is being built. It needs adoption.

Other industries have navigated similar problems by agreeing on behavioral norms before the courts forced their hand. The web data industry can do the same. Define acceptable behavior, make it auditable, and create the conditions for accountability.

2. Dataset Documentation Standards

Right now, when a dataset changes hands, the buyer often has no reliable way to know how it was collected, by whom, or using what methods. Were residential proxies involved? Was automation used to bypass security measures? Was CAPTCHA automation in play?

This metadata is not just a quality issue. It is becoming a legal one. As courts and regulators turn their attention to AI training data, the provenance of a dataset, the documented chain of custody from source to model, is about to matter enormously to anyone trying to use it commercially.

Dataset documentation standards would establish what information must accompany a dataset: its origin, the methods used to collect it, the date of collection, and the tools involved. Simple to define. Increasingly essential to have.

3. Third-Party Attestation

Standards mean nothing without verification. The third building block is independent attestation for data collection practices, similar to what SOC 2 did for software security.

Not a government license. Not a self-certification. A credentialed third-party review that confirms you are operating responsibly, conducted by auditors who understand what responsible operation actually looks like in this context.

When compliance platforms like Drata emerged around SOC 2, they created an entire market around verifiable trust. Enterprise buyers began requiring it. The same dynamic will emerge here. Data collectors who can demonstrate clean practices will earn a real competitive advantage, and buyers of AI products will increasingly demand it.

4. A Provenance Layer

Verification means nothing without a way to trace value back to its source.

The fourth building block is a provenance layer that connects data sources to the AI models trained on them. If a publisher’s content contributes to a model’s capabilities, there should be a traceable, verifiable record of that contribution, and ideally a mechanism for that publisher to share in the value created.

At the same time, buyers of AI products would get something they increasingly want and cannot currently verify: proof that the underlying data was ethically and legally sourced. The model earns a stamp. Trained on fairly collected, independently verified data. Trust flows downstream.

This is the infrastructure that makes a sustainable open web possible.

5. A Creative Commons for Web Data

The open-source software world solved a version of this problem decades ago. Creative Commons and MIT licenses answered the question that had paralyzed developers and content creators alike: free to view, but what can you actually do with it?

Web data needs the same clarity. A stored, accessible dataset should carry terms: available for research and non-commercial use at no cost, available for commercial use under defined conditions. Simple. Enforceable. Fair.

This is not a new idea. It has simply never been applied to this domain. The data economy is the last major information sector operating without a licensing framework that distinguishes access from commercial exploitation. That needs to change.

6. Legal Safe Harbor for Transparency

The sixth building block may be the most counterintuitive, because it describes a situation that is currently backwards.

Right now, a data collector who wants to operate transparently, who wants to publicly disclose what they are scraping, how they are doing it, and why, can still receive a cease and desist the next day. A website that wants to participate in the value created by scrapers and enter into a pay-per-crawl arrangement can face legal liability for allowing it.

Transparency should lower legal exposure, not leave it unchanged. We need a safe harbor that protects actors who operate openly and responsibly, and that creates a pathway for cooperation between collectors and publishers without the threat of an instant lawsuit. The goal is to make the right behavior the easy behavior.

The Enforcement Mechanism Is Human

Six building blocks: behavioral standards, dataset documentation, third-party attestation, a provenance layer, open data licensing, and safe harbor for transparency.

None of it requires a federal mandate. None of it depends on the courts getting it right.

The enforcement mechanism for all of this is not technical and it is not legal. It is human. Enterprises buying AI products will start asking harder questions about where the training data came from. Investors will start pricing in data liability risk. Procurement teams will favor vendors who can show their supply chain is clean. The market does this. It is just waiting for the information infrastructure that makes informed choices possible.

There are serious people across this industry, on the collector side, the publisher side, the regulatory side, the legal and academic side, who are working toward exactly this. The standards bodies are moving. The conversations are happening at the right tables.

The question isn’t whether we build this framework. It’s whether we build it before the lawsuits and paywalls make the open web a memory.

The window is open. But it won’t be forever.

A note on scope: personal data and copyrighted creative works have their own data classification and will require more nuanced treatment. This framework is focused on the type of data Sequentum works exclusively with, factual, publicly available information.

Sarah McKenna is CEO of Sequentum, an enterprise web data platform with 17+ years of experience in precision data extraction, compliance, and transparency. Sequentum is SOC 2 Type II certified. Learn more at sequentum.com

Frequently Asked Questions

Is web scraping legal?

Courts in the United States have held that scraping publicly accessible data is not inherently illegal, but a growing body of litigation is shifting attention from whether you scraped to how you scraped. The methods used, whether CAPTCHA automation was involved, what kind of data was collected, and whether the collection was proportionate to the burden placed on the site, are all increasingly relevant to legal exposure. The most defensible position is not to argue that scraping is legal in the abstract, but to be able to demonstrate that your specific collection practices were responsible, transparent, and proportionate.

What is the difference between web scraping and AI crawling?

Web scraping, in the traditional enterprise sense, is targeted. A scraper goes to a specific part of a website and extracts specific data points, often searching for what it needs rather than downloading everything. The traffic is typically minimal relative to the site's overall load. AI crawling, by contrast, tends to be indiscriminate. These crawlers download enormous volumes of content from across a site, or across many sites, without regard for the burden they impose or the value they return to the source. The distinction matters legally because the how of data collection is increasingly at the center of litigation, and it matters practically because indiscriminate crawling degrades the web for everyone.

What does robots.txt have to do with any of this?

Robots.txt is a file that websites use to signal to automated agents which parts of a site they prefer not to be accessed. It is not legally binding, and courts have generally not treated ignoring it as a legal violation on its own. For traditional scrapers, the relevant entry point is often a search page rather than a site map, and robots.txt almost universally restricts search pages. Following robots.txt literally would make most legitimate scraping impossible. The more meaningful question is whether your collection practices are responsible and proportionate, not whether you followed a file that was designed for a different era of the web.

Does this framework cover personal data?

No. Personal data and copyrighted creative works involve distinct legal and ethical considerations and will require their own frameworks. The six-part framework in this post is focused specifically on factual, publicly available information, which is the domain Sequentum operates in. GDPR, CCPA, and related privacy laws already create significant obligations around personal data, and those obligations apply regardless of whether a broader web data standards framework exists.

Who is responsible for enforcing these standards if they are not government-mandated?

The market, ultimately. Enterprise buyers of AI products and data services are already starting to ask harder questions about data provenance. Investors are beginning to factor data liability risk into valuations. Compliance and legal teams at financial institutions, which are among the largest consumers of web data, have been updating their due diligence questionnaires to require attestation of responsible collection practices. The pattern is familiar: SOC 2 was not legally required, but it became a de facto requirement for any software vendor selling to enterprise customers. The same dynamic will drive adoption of web data standards. Collectors who can demonstrate clean practices will earn access to better customers. Those who can’t will find themselves increasingly shut out.

What is pay-per-crawl and why does it matter?

Pay-per-crawl is a model in which data collectors pay websites directly for access to their content, similar to how companies pay for API access today. It is an attractive concept because it creates a direct economic relationship between the collector and the source, replacing the current situation in which publishers bear the costs of serving data to scrapers while receiving none of the value. The challenge is that the legal environment currently discourages both sides from initiating this kind of arrangement. A collector who reaches out to a publisher to propose a pay-per-crawl agreement risks receiving a cease and desist in response. Safe harbor provisions for transparent actors would make these conversations possible and, over time, normalize the model.

How far away are we from any of this actually existing?

Closer than most people realize. An IETF internet draft for web data collection standards has already been submitted. The Alliance for Responsible Data Collection has published web data collection considerations that are already being used as a de facto reference for responsible practice in the financial industry. Standards bodies are engaged. The legal community is paying attention. What’s missing is not the will or the ideas, it’s coordination and adoption. The window to build this framework proactively, before it gets built reactively through litigation and regulation, is real and it is narrowing.

web scrapingstandardsopen web

Share this article: