cc-wiki-dump January 23, 2014

Stack Exchange Creative Commons data now hosted by the Internet Archive

We’ve been publishing an anonymized dump of all user-contributed Stack Exchange content since 2009. Unfortunately, at the end of last year our former host, ClearBits, permanently shut down. So we set out to look for a new home for our data dumps, and today we’re happy to announce that the Internet Archive has agreed to…
Avatar for David Fullerton
President and Chief Technology Officer (former)

We’ve been publishing an anonymized dump of all user-contributed Stack Exchange content since 2009. Unfortunately, at the end of last year our former host, ClearBits, permanently shut down. So we set out to look for a new home for our data dumps, and today we’re happy to announce that the Internet Archive has agreed to host them:

The Stack Exchange Data Dump at the Internet Archive

We’ve been big fans of the Internet Archive for a long time, and we’re really happy to be working with them on this.

Wait, what’s this data dump?

All community-contributed content on Stack Exchange is licensed under the Creative Commons BY-SA 3.0 license. As part of our commitment to that, we release a quarterly dump of all user-contributed data (after carefully sanitizing it to protect user private data, of course).

Each site can be downloaded individually, and includes an archive with Posts, Users, Votes, Comments, Badges, PostHistory, and PostLinks (new). You’re free (and encouraged!) to share, remix, analyze and build on top of this data any way you want, as long as you follow the attribution requirements.

What are the attribution requirements?

In keeping with the spirit of sharing and proper attribution, and as the “attribution” part of the CC BY-SA license, we require that you do the following when you use the data:

  1. Visually indicate that the content is from the Stack Exchange network
  2. Link back to the original source question or answer
  3. Display the author names for each question and answer you show
  4. Link back to the author’s user page

Those links should be ordinary hyperlinks directly to the Stack Exchange site, without “nofollow” or any obfuscation or redirection tricks plainly visible on the page (we’re looking at you, content farms).

I’m too lazy to download this giant zip file. Can’t I just play with it online?

You’re in luck! We also make the data available through the Stack Exchange Data Explorer (an open-source project maintained by community member Tim Stone) which lets you run SQL queries directly against a copy of the data. It’s updated weekly, and includes some data that’s not in the data dumps in order to keep the size of the downloads reasonable.

If you want to access the data programmatically, we also have a pretty expansive JSON API that returns similar data to the dumps (but updated in real time with the websites). If you need help, we have a whole site for people developing apps on top of the API.

So take our data for a spin! We love seeing what people create with it, from apps to research papers or even machine learning contests. Making this data easily accessible is just our way of giving back to the community that has made Stack Exchange so successful.

Are you considering turning your data dives into a new career? Discover the opportunities in our data scientist job postings.

Podcast logo The Stack Overflow Podcast is a weekly conversation about working in software development, learning to code, and the art and culture of computer programming.

Related

Wooden figures standing in a circle facing each other
the-loop May 26, 2020

The Loop, May 2020: Dark Mode

We received a bunch of requests to share how we use our feedback framework on specific features. We got excited about this, and given that we just released Dark Mode (and “Ultra Dark Mode”), we thought this was a great opportunity to show how we arrived at our solution.
Avatar for Sara Chipps
Director of Public Q&A
the-loop January 22, 2020

The Loop #2: Understanding Site Satisfaction, Summer 2019

We’re excited to share research highlights about the work we’ve been doing to understand how satisfied people are with Stack Overflow. We’ve been working hard to explore what users like best about Stack Overflow and what their top pain points are, with the goal of improving the overall experience of using the site. To this end, we’ve launched a site satisfaction survey, in which we continually survey users about their experiences using Stack Overflow.
Wooden figures standing in a circle facing each other
community December 11, 2019

The Loop #1: How we conduct research on the Community team

If you work on a product that’s ever benefited from research – whether that’s talking directly to users, analyzing experiment data, or any number of other research methods – you know how indispensable these inputs are for making the right decisions. But how do you decide which methods to use and when? How do you know if you’re spending the right amount of time on research? How do you know when it’s time to change your research methods?
Avatar for Donna Choi
Community Design Lead