Automated Shitposting – Creating a 4chan Frontpage Scraper

In the last week or so, I’ve become interested in “Qanon,” to most recent conspiracy theory to explode in popularity. Its supporters have fully bought into the notion of the “deep state” and claim (among other things) that Special Counsel Robert Mueller’s investigation is actually about Obama, Hillary Clinton, and other prominent Democrats. The investigation into Trump’s ties to Russia is just a cover, naturally. The conspiracy theory purports that there is an individual high up in the government, Q, who is secretly communicating with patriots through cryptic online posts. Individuals at Trump rallies have been seen carrying signs with Q messages, and wearing Q shirts in addition to their distinct red MAGA hats. It has even reached the point that White House Press Secretary Sarah Huckabee Sanders was asked about the Q conspiracy.


It’s wacky, to say the least.

The veracity of these Q claims notwithstanding, the fact that this type of conspiracy theory was able to spread so rapidly from online spaces and become actual actions in the physical world is especially interesting to me. How did such an idea spread and become so popular? How were people talking about it in online communities? What does it mean for us to live in a world where a (relatively) small online community can so easily create national news?

To start finding answers to these questions, I wanted to examine one particular community where much of these Qanon theories have originated, 4chan’s /pol/ Politically Incorrect imageboard. This is the same community that was heavily associated with the redefinition of Pepe the Frog as a symbol of white nationalism, one of the first prominent cases of Internet memes and online communities and their involvement with national politics.

Of course, researching and studying online communities can be incredibly difficult. Contrary to popular belief, once something is posted on the Internet, it isn’t necessarily “there forever.” When I was in elementary school, I was constantly told that once something was online, it was impossible for it to ever be removed. The reasoning behind this is sound—encouraging young people to be cognizant of what information they share is incredibly important. However, the truth is that there is plenty of online content that has simply disappeared. People stop paying their web hosting bills, links fail to get updated, or perhaps in the countless petabytes of data old content simply gets forgotten. And in the case of 4chan, threads are regularly pruned and “content is usually available for only a few hours or days before it is removed.” This ephemerality, combined with the anonymity afforded by the website, challenge traditional conventions of research. It isn’t necessarily possible for someone to visit the same URL and access the same content.

Given these challenges, I decided to work on creating an automated system to scrape 4chan content and save a local copy. There are a handful of projects that enable an individual to download images en masse from 4chan boards, but I wasn’t able to find any that also recorded reply text and other metadata from the website. Luckily, 4chan offers a public API to easily access much of the information from their various imageboards.

After several hours, frustrating Google searches, and confusion over basic programming functions, I had a simple Python script that seemed to do what I was expecting.

Screenshot 2018-08-07 10.44.11
This would be so much easier if I had actually taken a Python class at some point.

I’m actually very happy with the 4chan scraper that I created. It is able to take any given imageboard (such as /b/, /pol/, or any other) and automatically collect data about all the threads that are currently on that board’s frontpage. Additionally, it iterates through each thread, and downloads all of the images that were uploaded. Finally, the script also collects the text of every comment, as well as basic metadata, and writes these into individual CSV files. It’s far from a perfect implementation, but at the very least it should be a useful tool for me to attempt to study the dynamics of /pol/ and other similar 4chan communities.

Screenshot 2018-08-07 12.03.44
Snapshot of the /pol/ frontpage from August 6, 2018 at 6:02 PM

There are several immediate benefits to scraping this kind of 4chan information. For one, it provides an actual permanent record of the website. The imageboard is dynamic, and content is constantly changing. This script, therefore, can provide timestamped snapshots of the frontpage, enabling analysis of what’s being discussed and what’s popular at any given time. There are already websites that create mirrors of 4chan and its content. However, my script also collates this data into a format that is more conducive to organization an analysis, through the CSV files that it generates. For instance, it is possible to search for the unique ID generated for each Anonymous user to estimate the number of actual unique individuals in a given thread, as well as if a single user is responsible for certain content. This helps to break down the homogeneity of the /pol/ community; it’s not just a single amorphous “Anonymous,” but rather multiple individuals with their own opinions, views, and beliefs.

Screenshot 2018-08-07 12.19.22
Example of the scraped information from a single 4chan thread

Over the next several weeks, I plan to continue scraping data from the /pol/ frontpage and formulating specifics ideas and thoughts. For now, though, I have the tools I need to do this kind of research, and am finally ready to press on!

Here’s the link again to my 4chan-scraper on GitHub.

2 thoughts on “Automated Shitposting – Creating a 4chan Frontpage Scraper

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s