Ongoing community data protection

This post is the third in a series focused on the importance of human-centered sources of knowledge as LLMs transform the information landscape. The first post focuses on the changing state of the internet and the data marketplace, and the second post discusses the importance of attribution.

Socially responsible use of community data needs to be mutually beneficial: the more potential partners are willing to contribute to community development, the more access to community content they receive. The reverse is also true: AI providers who take from our community without giving back will have increasingly limited access to community data.

The data used to train LLMs is not available in perpetuity. These partnerships are a recurring revenue model and a subscription service. Loss of access is retroactive—partners must retrain models without the data after this data is no longer available to consume and update.

Terms outlined in contracts are one type of data protection, but other methods, both subtle and overt, supplement them: block lists (via Robots.txt and other means), rate limiting, and gated access to long-term archives are ways to politely guide those who might be searching for workarounds and back doors to leverage community content for commercial purposes without the appropriate licensing. In the last year, Stack has seen numerous data points that suggest LLM providers have escalated their methods for procuring community content for commercial use. In the last few months, non-partners have begun posing as cloud providers to hack into community sites, which led us to take steps to block hosted commercial applications that do so without attributing community content. At the same time, these strategies will help turn potential violators into trusted customers and partners by re-directing them to mutually beneficial pathways for all parties. (This also serves as a reminder for users of all tools and services to pay close attention to terms, conditions, and policies to know what you agree to.)

When done thoughtfully, those pathways can still open the front doors to data use for the public and community good. For example, academic institutions wishing to use data for research purposes or communities looking to guard their collective work against unexpected systemic failure should not have their legitimate activities restricted. This balances the licensing of community content and preservation against the continued openness of the Stack Exchange platform for community use, evolution, and curation.

That said, more complex techniques will continue to evolve as technology advances. Search, still a hub for clearly-sourced, organized knowledge, can also be a trojan horse for LLM summarization, choking off traffic and attribution. Monitoring approaches and data scraping policies will continue to evolve along with the patterns of unacceptable exploitation. As these methods evolve, so must our responses: Stack will continue to protect community content and health while creating pathways for socially responsible commercial use and open access to collective knowledge for its community. In doing so, communities and AI can continue to add to and reinforce each other instead of creating mutually assured destruction.

This series has outlined a vision in which continuous feedback loops and cycles in the data marketplace benefit all involved.

We know from the 2024 Developer Survey findings the top three challenges developers listed in our 2024 survey when it comes to using AI with their teams at work are that they don’t trust the output or answers (66%), AI lacks the context of internal codebase or company knowledge (63%), and the right policies are not in place to reduce security risks (31%). Companies and organizations who partner with Stack Exchange (and other human-centered platforms) get:

Increased trust from users of their products via brand affiliation with reputable sources; increased awareness and reputation of those products and services.
Higher accuracy of the data delivered to end users via APIs that package and filter data, focusing on integrity, speed, and structure. Content that is not useful can be excluded or handled differently.
Reduced legal risk via licensed use of human-curated data sets.

We know that the top three ethical issues related to AI that developers are concerned with: AI's potential to circulate misinformation (79%), missing or incorrect attribution for sources of data (65%), and bias that does not represent a diversity of viewpoints (50%). Developers and technologists using partner products that include community content get:

Higher trust in the content delivered to them.
Easy ways to go deeper on topics and do their verification via attribution and linking to sources.
The ability to pair internal organizational knowledge with broader community knowledge via knowledge-as-a-service solutions.

We know that Stack Overflow contributors also share these fundamental concerns about the circulation of incorrect information, clear and accurate attribution, and ensuring that diverse perspectives are available. They also care deeply about the platforms that house their work, overshadowed and forgotten. Knowledge authors and curators get:

Reassurance that their contributions will persist into the future and continue to be open to benefit others.
Recognition of their individual and collective efforts via attribution.
Revenue from licensing invested into the platforms and tools they use to create the knowledge sets.

Earlier in this series, we mentioned that we are all (users and companies) at an inflection point with AI tools. Only by following a vision like ours can we preserve a more open internet as the technology space and AI evolve.

Ongoing community data protection

Benefits for all

Add to the discussion