What Internet Will Look Like in the Future
cripes does anybody remember Google People
– qntm, August 21st, 2019
Does anybody remember PhysicsForums?
It was never exactly the center of the Internet, but back when it was founded in 2001, the Internet didn’t really have a center the way it does today. PhysicsForums was one forum among thousands, founded by an enthusiastic teenager named Greg Bernhardt, existing in the ‘hard science’ niche alongside the likes of Bad Astronomy, mostly focused on giving hints for physics homework to struggling students without outright doing the physics homework. It had fairly steady growth until 2012, before petering out throughout the 2010s and 2020s in lieu of more centralized sites like StackExchange, and by 2025, only a small community was left. But, unlike so many other fora from back in the early days, it went from 2003 to 2025 without ever changing its URLs, erasing its old posts, or going down altogether. Thanks to this consistency, PhysicsForums remains quite valuable as a time capsule, and can give us a glimpse at how people thought and what they said two decades ago.
For instance, we we might go to this 2007 post asking about the capabilities of the now-obsolete HP 50g Graphing Calculator, and read a very helpful reply by commenter ravenprp:
Something has gone terribly wrong.
Quoth the Raven?
I reset my pw and logged back in to Google People for the first time in 10 (?) years just now and discovered the following:
- 10 years of updates on my account, written by me
– qntm, September 7th, 2019
At first glance, ravenprp is a very impressive user, writing 2,891 posts in a mere seven-month span (from September 2006 to April 2007) for average of more than thirteen posts per day. And most of these weren’t casual one-line answers: reading his profile, one comes away realizing that he’s a true omnidisciplinary scientist, if not a modern-day Renaissance Man. He’s a member of a university faculty, a mechanical engineering student, and a hiring manager at a chemical plant (presumably in Canada); he’s an expert in working with .VOB and audio files and fasteners; he’s a structural engineer, biologist, medical physicist, chemist, and aerospace engineer. And to top it all off, he’s modest, and won’t claim expertise he doesn’t have: he’s an aerospace and aeronautical engineer, but not an expert in aerodynamics, nor a mechanical engineer, nor an expert in electrical models.
Impressively, these posts span from three years before the account was created to a year after the account was last logged into. And, as the icing on the cake, ravenprp is prescient enough that he can joke about being a language model developed by OpenAI, seven years before OpenAI was even founded; evidently he should have joined PsychicsForums instead.
A reasonable question, reading these posts, is: what’s the deal with the joined and login dates? If this is an entirely fabricated account, then why not just set the account creation date to be earlier than the first post, and the account access date to be later than the last post? The Wayback Machine gives us an answer: while the account currently lists 2,891 posts, an archive from 2019 lists 74. And looking at only these 74 gives a very different profile: ravenprp was an electrical engineering student asking for simple advice on MATLAB books, working through integrated circuit problems, giving other commenters useful links on the Bernoulli Equation, and asking for help from knowledgeable experts, such as computer professional ravenprp.
So it seems that this was a real account, once, and that ravenprp was a real person with real posts, opinions, questions, and answers. But any original thoughts and writings he might have had in 74 original posts have been all but drowned out by the 2,817 extra posts which have been added, backdated, and attributed to him.
Still, while finding those posts might be a challenge, at least those threads themselves are intact. For instance, he was helped in one thread by kyle8921:
And the answer would seem pretty reasonable at first glance, were it not for two small problems:
- Kyle8921, oddly enough, decided to finish an incomplete LaTeX expression from ravenprp’s post before beginning his own answer.
- Much like ravenprp, Kyle8921 posted the answer more than a year after he last logged in, and continued posting for more than three years after that.
Exactly how deep does this go?
The Internet is Forever
In regards to the dead internet hypothesis, the content that you’re enjoying today, will still be there tomorrow.
LoveMortuus, July 24th, 2024
Founder Greg Bernhardt started PhysicsForums as a simple GHB bulletin board, and he gave each thread and post a unique numerical ID; unlike many modern sites, this ID was sequential, with the first post given the ID of 1, the second post 2, and so forth. Some of these posts, of course, have since been deleted for one reason or another (including the first two years’ worth of discussion, which is why post #1 is in 2003 and not 2001), so we would not expect the line of posts to be continuous and unbroken…but we would expect the line to be monotonically increasing, so long as no posts had been retroactively added to the database. Plotting a sample of 30,000 posts grabbed at random, we can see how that pattern holds up:
For almost the entire history of PhysicsForums, that rule held true: if post A had a larger ID than post B, then post A had a later timestamp than post B. And anywhere that rule does not hold true, we can reasonably infer that the database has been altered and that a more recent post has been given an older date retroactively.
Database alteration and retroactive dating are not necessarily bad things. The biggest example came in 2022, when PhysicsForums merged with MathHelpBoards and incorporated its ~150,000 post archive, with each post given its original date. There are smaller examples within PhysicsForums history of a thread getting split (due to some off-topic but nevertheless useful) discussion, or a post getting deleted but then restored with a newer ID, or a database hiccough causing re-assignment, or various other entirely justifiable small-scale edits. But on March 11th of 2023, a handful of posts were added all up and down the timeline, followed by a bigger group in May, an even bigger one in October, and the largest of all in January and February of 2024.
By taking the highly anomalous post IDs and filtering out the MathHelpBoards import, and filtering out instances which seem plausibly close to their actual post times, we get an estimated 115,000 posts written by LLMs and attributed to humans. Without scraping the entire website, it’s difficult to say precisely how many users have had their profiles revived, but taking a representative sample, there are at bare minimum 110 users affected. These range from some of the earliest commenters, to one-time posters to long-time readers to founder and site administrator Greg Bernhardt. In every case, a name is being attached to viewpoints that person does not necessarily endorse; certainly an average science enthusiast would not probably not endorse the notion that 0.999… does not equal 1, if he knew about it:
But that’s the problem: they don’t know about it. We don’t know who most of these people are. We can’t get in touch with them. Many, if not most, have usernames either unique to PhysicsForums, or else so common elsewhere as to be useless for identification. They have moved on from PhysicsForums, living their lives, with no way of knowing what is or isn’t being said under their names, and no reason whatsoever to suspect anything is amiss. And maybe what’s said is accurate, or maybe it isn’t, but either way, with every additional post, the archives of the Internet are made just a little bit flatter, their own scientific contributions are diluted just a little bit more, and the portion of the Internet intentionally written by humans is shrunk just a little bit further.
The Dead Internet Theory
The internet feels empty and devoid of people…it’s like a hot air balloon with nothing inside.
Anonymage, “Dead Internet Theory (and much more)”, September 16th, 2019
The ‘Dead Internet Theory’ - the theory that much or most of the Internet consists of things not made by human beings - got its name in 2019, and has been slowly gaining popularity ever since. The theory as originally stated is a self-described ‘jumbled mess’, cites the ‘death’ of the Internet at about 2016 or 2017, and interestingly, predates almost every known LLM capable of writing convincing longform text; we tried a GPT-2 bot on /r/AskReddit two days after the original greentext, and the bot was noticed and called out within a handful of hours. And in 2016 or 2017, the state-of-the-art for text generation was Markov chains, and most bots online posted identical messages with no text generation at all.
The theory, and the general feeling that the Internet has changed shape, clearly had other root inspirations besides LLMs. Part of the feeling was probably due to the Internet opening up to a wider and wider global audience, with small community norms and standards simply not scaling up. Part of it was probably the device change from computer to smartphone encouraging a more passive role on the part of the audience. Part of it was probably algorithmic incentives towards more and more content engagement, as inevitable as Las Vegas’ extravagant casinos, emerging from pure market forces and not attributable to any one person in particular.
But part of it now, six years later, is certainly LLMs.
The thing is, it’s not like these non-human additions are adding value or usability in the original ravenprp thread we started with. You could argue that adding a summary is helpful… but that’s only the case if the summary is correct, and as we know, the AI generated summaries have a real tendency to enrich the facts with things that may or may not be accurate.
We can accept the addition of some links, etc, as a potential enrichment; though their intrusive nature is disruptive like any other advertisement, sometimes we accept disruption as part of what is needed to keep a website operating. The rest of the thread, though, is a LLM-generated post that contains exactly one true fact (the existence of the “HP 50g Connectivity Kit”), and otherwise contains nothing that is actually reflected in the HP 50g manual, or anywhere else one might find genuine information about the calculator.
Finally, we have the FAQ. This FAQ is the only place in the entire thread which references the “GET” and “PUT” functions of HP 50g Spreadsheet, and is the only place the summary could be getting them from. For that matter, the FAQ the only place in the entire internet which references those functions; the manual doesn’t talk about them, or talk about the “HP 50g Spreadsheet” at all…and searching for “HP 50g Spreadsheet”, as of the time of writing, returns this thread and nothing else.
So while ravenprp’s post has a hint of correctness, the FAQ pulls only from the title and is entirely wrong, and the summary pulls primarily from the FAQ and is also entirely wrong. But the fact that these answers are incorrect is not actually the only problem. Instead of reading the replies, we can do a quick tally of the amount of space each section of this thread takes up:
Section | Word Count | Human? |
---|---|---|
Summary | 99 | No |
Springo | 38 | Yes |
Phys.org | 34 | No |
Shinny_head | 11 | Yes |
Ravenprp | 164 | No |
FAQ | 260 | No |
The LLM-written portions of this thread are 92% of the total word count. If this were a representative sample of the Internet, it would be reasonable to describe the Internet as ‘dead’.
The situation across the entire PhysicsForums is not, fortunately, quite so dire yet. While the backdated LLM posts tend to be longer than average, they still make up only 1.6% of the total post count and 3% of the word count across all posts. The actual deception is still, even now, still a fairly small portion of the total. But, when counting in the LLM-generated FAQs and summaries - which, as we’ve seen, aren’t always accurate - the Dead Internet Theory seems to be vindicated, and only 66% of the words on the site were written by human beings. Two years ago, that number was close to 100%. Two years from now, then…who knows?
It is true that the old archives aren’t read all that frequently, so it could be argued that this is no great loss. Only three archived PhysicsForums post were shared on Twitter throughout all of 2024…and of those three, one was a human sharing an bot-written post, and another was a bot sharing a human-written post. So really, does anyone care?
Man vs Machine
“…using AI, one can generate tons of nonsense with very little work and overload us easily.” So far that hasn’t happened.
PeterDonis, November 12th, 2024
Certainly the community members of PhysicsForums care. In November of 2024, poster Renormalize noticed peculiar behavior by Azntoon, and noticed specifically that Azntoon was posting in threads in 2022 despite having last logged on in 2012. The thread, however, soon came to the conclusion that it was some sort of database hiccough, and nothing more came of it. Interestingly, Azntoon’s last login date has since changed to December of 2022, making him one of the only two LLM-affected users to have (apparently) logged on again since the backdating began.
Users were on the lookout for it as early as 2022: the forum regulars debated what the policy regarding ChatGPT-generated responses should be, with Vanadium 50 complaining that the “barely-lucid posts” generated by ChatGPT create a lot of work and could be considered a DoS attack. The general consensus in the thread was summarized by PhysicsForums administrator Greg Bernhardt:
I’m pretty sure stackoverflow is attempting to ban it. I think we should discourage it, but I am unsure how to “ban” it here. Well worth a discussion. At a minimum content from ChatGPT should be quoted.
A few months later, the site celebrated April Fools’ Day of 2023 by temporarily changing all display names to ChatGPT, a few weeks after the LLM backdated insertion had already started, but since this was on April 1st, nobody assumed that posts were actually being written by ChatGPT. Alexander R. Klotz, one-time contributor to PhysicsForums’ Insights blog, noted on Twitter that his 2016 article had been edited without his knowledge, but the edits seem to be limited to an LLM-generated summary awkwardly placed in the middle, and a manually-inserted author’s note from 2018. The actual scope of the change across and the addition of more than a hundred thousand posts is something we’ve seen no discussion of, anywhere, as of yet.
But the backdated users have all been hidden from the search’s autocomplete, and profile links have all been removed from their posts, presumably to make it more difficult to click the profile and notice that something is wrong.
Changes like this tend to be gradual: a slight drift over time, compromises made to keep the site afloat. When the community notices, it can all be explained, eased, and made bearable, even as everything changes inch by inch. If the community backlash is too great, it’s easy to backpedal and explain and have a lesser version of the change remain. Ultimately, the incentives towards making these changes will not go away. What decides how they work out is the judgement of the people making the changes and how the community responds to them.
Update: FAQs
During the writing of this post, the FAQs were actually noticed as well, and a discussion thread was created. It appears that they were not visible to logged-in users, explaining why the forum regulars did not notice: the average forum regular interested enough to go back and reread old posts will probably be logged in while doing so. The FAQs have now been removed, though the LLM-written summaries at the top of closed threads are still present.
The Internet is People
The dead internet theory is coming to fruition.
Greg Bernhardt, January 3rd, 2025
We reached out to Greg Bernhardt asking for comment on LLM usage in PhysicsForums, and he replied:
We have many AI tests in the works to add value to the community. I sent out a 2024 feedback form to members a few weeks ago and many members don’t want AI features. We’ll either work with members to dramatically improve them or end up removing them. We experimented with AI answers from test accounts last year but they were not meeting quality standards. If you find any test accounts we missed, please let me know. My documentation was not the best.
And in response to a follow-up question about the community feedback, he replied:
Mathwonk originally raised the issue of quality. I worked with him to improve them using a newer model, but it’s still not good enough.
The backdated answers were an internal test. We conceived of a bot that would provide a quality answer to a thread without a reply after 1+ years. That too also failed. Instead, I’m considering pruning all threads without a reply as they clutter up the forums.
It’s hard to imagine that 110 existing users gave consent to be used as test accounts, for 115,000 posts, over four waves spanning almost a year. The idea that these are test accounts gone wrong, or a bot accidentally mislabeled, doesn’t seem to align with the facts.
It also ties directly to an unstated but very real expectation: my identity in an online community is mine. I have accounts on forums that are older than some friendships. I have written tens of thousands of words under various identities. Just because they are relatively anonymous doesn’t make them less real to me, it doesn’t diminish the effort and time I put into the work done under those identities. Hijacking accounts, filling it with content their original owners did not write? That dilutes their efforts, and muddles their identities. The content was produced with the idea that it was mine in some capacity, inextricably tied to an identity I owned.
There’s also a social contract: when we create an account in an online community, we do it with the expectation that people we are going to interact with are primarily people. Oh, there will be shills, and bots, and advertisers, but the agreement between the users and the community provider is that they are going to try to defend us from that, and that in exchange we will provide our engagement and content. This is why the recent experiments from Meta with AI generated users are both ridiculous and sickening. When you might be interacting with something masquerading as a human, providing at best, tepid garbage, the value of human interaction via the internet is lost.
Beyond that, the idea of populating existing accounts with LLM-generated content is destructive. Like paving over an arboretum to make room for a generic strip mall. Internet archaeology is already a difficult and fraught business. It’s so difficult to find lost content, servers that have gone down, websites that are just gone… and now, apparently a lot of backdated data that is AI generated. This is not to say websites shouldn’t evolve and stay current, but this is different, this is a re-writing of history, and rewrites history for no clear gain.
It probably feels odd to see us write thousands of words fighting for the integrity of a community neither of us is part of, a tiny speck on the Internet trying desperately to survive, an enclave of a different era that is trying to hold on at all costs. But we are sympathetic. Running a website, especially a forum, is expensive. Server costs go up. Databases stop working and now you need to pay an expert or spend hours of unpaid time working on it. Bots flood in. DDOS attacks happen. Another wave of crypto-scams shows up. Staying alive on the internet costs money, and money comes through users and ads. You need those clicks like a man in the desert needs water, and every week it gets more competitive.
One must transform to survive. That axiom is a truth on the internet. If you don’t, you rapidly find yourself buried on the eighth page of Google results, with no users and no money to keep the servers up. But when communities compromise their morals and the core of their identity to stay afloat, and destroy the very bedrock of their commitment to their users and to some degree to the broader idea of the Internet, we have to wonder… was it worth it?