I’ve spent a lot of time over the last several weeks thinking about the problem of digital longevity. Simply put, just because something has been posted on the Internet does not mean that it will be there forever. To be sure, content posted to giant platforms such as Facebook, Twitter, or Reddit will likely last for quite a while. These corporations have the money and physical infrastructure to maintain this data. Especially since most platforms’ business models rely on gathering and processing data, it makes sense that they would be willing to pour millions of dollars into server infrastructure.
However, my fear is that this level of data retention is the exception rather than the norm. For smaller platforms, or especially in the case of small personal websites, there are many factors that limit the ability to retain data forever. Maybe there just isn’t enough money to save everything. Even though data storage is incredibly cheap (especially thanks to cloud services such as Amazon Web Services, Microsoft Azure, or Google Cloud), data retention is nonetheless a cost that companies must consider. Oftentimes, data entirely disappears from the Internet because it is entirely forgotten about. Maybe a company closes down, or maybe somebody forgets to keep paying their web hosting bill–regardless of reason, content is constantly disappearing from the Internet, often irretrievably.
For someone like me, this is especially concerning. As somebody who wants to research Internet communities, and anonymous/pseudo-anonymous communities especially, it is important that I have that original data to work from. Much like other areas of research, which are limited by what is available in library archives, the destruction and loss of primary sources is incredibly troubling. For me, one of my most significant concerns is with one platform specifically–the anonymous imageboard website 4chan. Due to server limitations, 4chan explicitly limits each board to just 10 pages; content may only be available for a few hours to a few days, or even just a few minutes on high-traffic boards.
This problem was a huge driving force in my decision to write my python 4chan scraper. Even without a single explicit research question in mind, I wanted to have a way to preserve the front page, thread content, and images from 4chan. However, keeping this software constantly running on my laptop wasn’t really an option. I considered using my home media server, but decided against it due to storage limitations. So when I kept seeing ads for a free trial of Google’s Cloud Compute Engine, I jumped on the opportunity. They were offering about $300 in free credit, and I figured that my CPU and storage needs were relatively low, it would be quite some time before I even approached that $300 ceiling. It was a convenient service that let me keep the 4chan scraper running on a regular basis.
This service ended up working very well for this purpose, and I began slowly accumulating a record of every /pol/ thread ever. I have been semi-closely following and paying attention to the Qanon conspiracy, wanting to see how it spreads from niche corners of the Internet into real actions in the physical world. I’m still not entirely sure what I want to do with this project specifically, but maintaining a record of /pol/ threads (/pol/ is where many of the Qanon “drops” have originated from) will nevertheless be a useful resource for whatever I end up doing with this project.
The importance of creating this archive was made even more apparent when Reddit recently made the decision to ban the Qanon-related community /r/GreatAwakening for inciting hate speech and violence. While I certainly applaud the decision to remove the community, the ban also makes it much more difficult to access and study the community. Communities such as /r/GreatAwakening represent some of the negative aspects of online participatory culture, and the inability to examine these types of interactions may be a loss of online research more broadly. (Of course, there are many issues and problems with studying online communities, especially those that promote radical ideas.)
So for the last month or so, I took advantage of Google Cloud Compute Engine as a “set it and forget it” sort of a deal for my 4chan /pol/ scraper. The problem is, though, I set it and then I forgot it.
I had started with a 10GB virtual disk, and quickly realized that this was nowhere near enough storage space. After about a week of scraping, I noticed an entire lack of CPU usage, disk read/writes, and any network activity. The script just wasn’t running at all anymore! I had a “doh!” moment and realized that saving all of the images from every thread was actually eating up a lot of disk space. I increased the size of the virtual disk to 100GB, and decided that I would need to figure out some other method of managing that large of a data set. At that time I left it as “I’ll have to deal with that problem later.”
Well, later arrived. After just under a month of scraping, I had used up an entire 100GB of space.
As much as I would like to, it just isn’t feasible for me to scale up to save that much data. Yes, IaaS (Infrastructure as a Service) offerings such as Google Cloud are built to handle scaling use cases such as these, I ultimate made the decision to limit my /pol/ scraper archive to that 100GB disk. Partially, it’s an issue of money; I don’t really want to pay Google for the storage space, especially since I’m not even fully certain of what I want to eventually do with the /pol/ data. And more importantly, I realized that I was just going to be creating and maintaining a 4chan mirror in effect. And that’s just not a road I want to head down. Ultimately, I made the decision to significantly prune the data and make some decisions on what to keep and what to remove.
It helped being able to tell myself that I had already successfully the script, and had set up a somewhat efficient workflow with Google Cloud. So I could always reconfigure it and re-work it for use in later projects. But for now, I am only focusing on collecting 4chan threads related to the Qanon phenomenon.
Luckily, I wrote my script to generate text-based summaries of the /pol/ board. Each hour, I create a CSV file that records the threads on the front page. Then, for each of those threads, I create a CSV file that contains each post within the thread. Each of these files contains lots of useful data and metadata, such as date and time stamps, user IDs, and–most importantly in this case–thread and post titles. I was able to skim through a list of all the threads that I had gathered up to this point, looking for any mention of “Qanon,” “Q”, or the “Deep State.” In this process, I also made note of any other threads that seemed interesting and caught my attention. For instance, there were a few “Trump Appreciation” threads that I chose to preserve as well.
For those threads only, I made copies of the entire thread folder–CSV file, as well as all images that were posted there. Rather than keeping these on the VM’s virtual disk, I copied these into a separate Google Cloud “storage bucket.” I don’t need to have immediate access to this data, so storing it separately from the active VM instance lets me cut down on costs somewhat. I also took this opportunity to back up all of the CSV summary files to the storage bucket as well, which centralized all of the data in a single location, and makes it possible to exfiltrate that data to my own computer for analysis and writing.
Next, came the most stressful part of this whole operation–actually deleting the content that had not made the cut. I recursively searched through the 4chan scraper’s /pol/ output folder, to identify and remove every .jpg, .png, .gif, or .webm file.
This screengrab, which was taken just 12 hours after the massive purge, shows just how many files can pile up in a short period of time. The recursive search and delete let me maintain the thread hierarchy, and keep track of which CSV file belonged to which thread. All said and done, I deleted about 96 GB of saved image and video files. The VM virtual disk went from 100% usage to just under 4%, and a sizable portion of that 4% is likely the OS and other system files.
At the end of the day, I was still a little distraught over having to delete so much data. Especially because I’m not 100% certain what I want to do with the 4chan /pol/ archive, and how (if at all) I want to incorporate it into an eventual project on Qanon, it was incredibly difficult to do this. I really want to have as much material to work from as possbiel. However, I know that it is also critical that I balance this desire with the reasonable confines of what is practical and what is feasible. The sad reality is that not everything on the Internet can be saved forever, and a lot of content only lasts as long as it is practical and feasible for someone to maintain it. And the difficult truth to accept from this information, is that it’s probably a very frequent occurrence that large swaths of information are deleted arbitrarily and on a whim, much like I did during my process of housekeeping. A large part of doing research on the Internet is accepting these realities, and confronting and working through the very real problem of data longevity and retention. Hopefully my deletion decisions won’t come back to haunt me, but in case they do, at least I have these notes to (perhaps) justify my decisions.
Stay tuned for more Qanon, 4chan, and general political meme news.