You might put things online to make them permanent, but internet archives can disappear

Justin Sullivan/Getty Images

What's at risk when sites disappear?

Here’s a list of things that have disappeared or changed on the internet in just the past few weeks: The social network Google Plus shut down, taking all its archives with it. That includes the profile pages of Google’s founders, removing access to insights about the company’s history and decisions. Facebook said it “mistakenly deleted” posts by CEO Mark Zuckerberg, but also changed how it archives corporate announcements and blog posts in a way that makes them harder to find. MySpace accidentally lost 12 years of posts from its users, including their estimated 50 million original songs. Host Molly Wood talked with Jason Scott, an archivist with the Internet Archive (the nonprofit digital library managed to recover a fraction of those MySpace songs). The following is an edited transcript of their conversation.

Jason Scott: MySpace, which was one of the largest sites of the 2000s, always encouraged people to upload videos and music and photos. Then it turned out last year, during a migration, they lost everything that was older than about three or four years. There are estimates going around of something like 50 million songs lost. The Internet Archive was handed, from an academic group, a USB drive worth of songs, about 540,000 [MySpace] songs which had been part of an academic study on music networks back in 2010. That’s the only reason we have this music. It’s not anywhere near a full recovery. I wish it was.

Molly Wood: How common is it for these private companies like MySpace or Facebook or Google to either accidentally or intentionally lose information?

Scott: The accidental [loss] will always get a headline. Luckily that doesn’t happen as much as it might have years ago. But intentional, where they hit the end of a product’s life cycle … we’re in this situation right now where people have a lot of data that means something to them, and they’re getting very surprised a lot.

Wood: Right. This sort of raises a fundamental question: Are there any rules? Are there any laws or regulations about what companies have to hold on to and then any recourse if they don’t?

Scott: Generally, no. There are certainly a lot of laws around health information due to [the Health Insurance Portability and Accountability Act]. But we don’t do that for other data. I’m very much thinking of this as a tenant-landlord situation, that there are responsibilities that really should be in there if they’re housing your data because they are certainly benefiting from it.

Wood: Right. It sounds like you’re saying that preservation and archiving should be part of the larger data conversation that we’re having right now. Should that be part of the GDPR or any version of regulations like that that might come to the U.S.?

Scott: I don’t think it’s unrealistic to say that export — the ability to pull out data — the ability to be notified and have it held for a certain period of time after shutdown, that this should be part of the online experience.

Wood: Let me offer an alternate view. Something like 90% of all of the data in the world has been created in the last [few] years. Data centers are increasingly gigantic and consume tons and tons of energy and represent a real threat to the climate. Is there also a conversation that we need to be having about data expiration?

Scott: I would say that it’s an interesting argument to argue about the environmental [impact]. But you look at something like bitcoin, which nobody knew about years ago and now suddenly takes up such a percentage of energy, and I think we’re going to continue to see waves of that. It’s in companies’ interests to make data hosting, data retrieval and everything else be as inexpensive as possible. There’s money to be made. I could definitely see an argument that people generate too much data and that we end up with a lot of data. But I think that choice should be yours, not something you read about in a tweet the next day because something went “wrong.” [Companies] are efficiently or inefficiently losing the data of users who have no say and no knowledge of what they’re doing in terms of keeping access to their own data.

Wood: Simply put, for consumers out there, how reliable is the digital record?

Scott: The digital record is very reliable until it isn’t. It lets you have enormous amounts of reach, easy copying, easy access, easy sharing. But when things go wrong, they will go wrong utterly. You can recover a burned book. You can’t recover a literally dead disk that doesn’t work at all without spending an amount of money that nobody would spend. We have the best of times and the worst of times right now. I think that people should be aware that if something matters to them, that in some ways they need to be the caretaker.

Related links: more insight from Molly Wood

Jason Scott told us the Archive had been trying to preserve the public posts from Google Plus before it shut down, and that it’s proven to be over a petabyte of data, or a million gigabytes. Way back in 2009, Gizmodo created what I think is still the best-ever infographic that tries to show how much a petabyte is. The best part is how Gizmodo determined that it equaled 13.3 years of high definition television content, or 58,292 movies, and if each movie requires one large pizza (which it does), then a petabyte equals 52 tons of pizza.

There’s a good story from the BBC about how much from the early days of the web is now gone. That’s partly because it was around for five years before it even occurred to anyone to start archiving it, and that was the Internet Archive. There are other organizations working on a parallel effort now. The British Library has a UK Web archive. The National Library of Australia just put up an archive of that country’s websites.

But as more and more people put more and more data in just a few places, like Facebook for example, researchers worry that if MySpace can lose 50 million songs, what could Facebook lose? There are concerns that we’re headed for a digital dark age where only a tiny portion of the digital record we’re creating right now will be preserved or even readable by future generations. Then again, I think all those people who’ve had their old tweets dug up and used against them or the people posting so many ill-considered things online right this second might argue that maybe not everything needs to be saved.

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.