
What began with a quiet release from GPTZero turned into a thunderclap across academic corridors. 51 accepted NeurIPS papers with more than 100 citations that just didn’t exist were identified via their audit, which was subtly named “Hallucination Check.” Not misquoted. Not outdated. Invented. Entirely.
For a moment, the stillness spoke all. NeurIPS, long considered the epicenter of artificial intelligence discoveries, was suddenly facing into a mirror held up by the very tools it helped inspire. However, the reflection was terribly warped.
| Item | Details |
|---|---|
| Event | Hallucinated citations in 2025 NeurIPS conference |
| Papers Affected | 51 accepted papers with over 100 fake citations |
| Detection Method | GPTZero’s “Hallucination Check” tool |
| Conference Status | Prestigious annual AI and ML research conference |
| AI Role in Issue | LLMs used for writing and referencing, led to fabricated citations |
| Reviewer Challenge | Volume overload and lack of manual fact-checking |
| Public Reaction | Concern over academic standards and review integrity |
| Credible Source | TechCrunch, Jan 2026: https://techcrunch.com/2026/01/21/irony-alert-hallucinated-citations-neurips/ |
These weren’t careless mistakes or ignorant omissions. They were citations so convincingly formatted—down to author initials and journal style—that even seasoned reviewers, driven by timetables and deluged by submissions, missed them. They had a sound academic structure. Their content, academically void.
By exploiting GPT-like language models, authors inadvertently—or perhaps conveniently—allowed AI to generate references with remarkable fluidity but no basis. And amid the quick fire of conference deadlines, placeholder citations like “[Doe, 2022]” became permanent features. No follow-up. No verification.
NeurIPS has evolved over the last ten years from a close-knit community of neural network aficionados to a large, competitive arena. In 2025 alone, the conference got nearly 21,000 submissions. With that scale comes automation—of sorting, of evaluating, and, as it turns out, of referencing.
Surprisingly, this wasn’t wholly anticipated. Researchers have long warned about the potential of LLMs delivering confident but incorrect outputs. What’s startlingly consistent throughout the flagged pieces is how each false citation replicated the cadence of a real one. One cited the publication “Advances in Multi-Agent Coordination, Wang et al.,” which seems credible but never existed.
Once, when going over an AI-generated manuscript, I paused at a citation that cited my own work, even though I hadn’t created it. That odd blend of flattery and falsehood is a hallmark of LLM hallucination, and for researchers racing to meet deadlines, it becomes easy to overlook.
Through smart partnerships with citation management and autocomplete software, many academic writers have shortened their process. But such streamlining, while astonishingly efficient in improving performance, has unwittingly lowered the thoroughness traditionally demanded in literature evaluation.
For medium-sized research teams with limited resources, LLMs offered efficiency. But that efficiency has now displayed its edge. Reviewers at NeurIPS, understandably overburdened, weren’t equipped with hallucination detectors. Their priority was process, not metadata.
By integrating detecting techniques like GPTZero, institutions are now striving to contain the harm. But questions linger: How did these inaccuracies withstand peer review? Why didn’t authors double-check their references? And most significantly, what does this indicate for the legitimacy of AI research?
Particularly sophisticated technologies like Claude and Humanizer promise to eradicate detectable AI traces in writing. This has led to an uncomfortable arms race between AI-generated language and AI-driven detection—a dynamic as unsustainable as it is comical.
During the 2025 cycle, one contributor called Kevin Zhu submitted over 100 papers to various AI conferences, many including high school co-authors through his company Algoverse. While his NeurIPS submissions were primarily workshop-level, the sheer volume underlines how publish-or-perish pressures have combined with scalable AI tooling.
In the context of academic publishing, citation integrity isn’t optional. It is fundamental. Fabricated sources not only confuse readers but risk spreading disinformation, especially when subsequent scholars unintentionally build upon faked foundations.
Over the past few months, comparable flaws have surfaced at ICLR and ICML, further showing that this is not an isolated NeurIPS incident but a broader systemic failing. Citations in several papers were hallucinogenic. Some have reviews produced by AI. One reviewer submitted 96 reviews—possibly employing an AI to generate them.
Nevertheless, the conference organizers have responded with cautious optimism. They admit the difficulty but believe the underlying research remains intact. They’re not wrong—but they’re not totally right either. Trust, once damaged, rarely returns in full.
The STM Report projects that 5.7 million academic papers were published in 2024—a 46% increase from 2019. Much of this rise is credited to generative AI. But more volume has not translated into higher standards. Instead, the academic world faces a paradox: more articles, but fewer that are extensively read or rigorously vetted.
I recently spoke with a graduate student who confessed to citing two papers she hadn’t read. It didn’t make her proud. But she stated that everyone she knew did the same. “It’s about formatting now,” she remarked. “Not facts.”
Through that lens, the NeurIPS episode appears less an exception and more a symptom. The combination of AI’s linguistic polish and academia’s speed preoccupation has created a publication climate that encourages volume above verification.
By introducing stronger citation checks, conferences might retake control. However, cultural changes are more difficult. The desire to offload tiresome activities to AI will endure, especially when AI accomplishes them faster, cheaper, and more convincingly than people.
Convenience comes at a price. And while NeurIPS 2025 will certainly recover from this disgrace, the academic community now faces a difficult question: If even its finest brains can’t differentiate fact from invention, what safeguards remain?
This historic and humble time calls for contemplation. Not rejection of AI—but a recalibration of how we use it. A chance to ask not only what’s feasible, but what’s permitted. a break in the fast sprint to determine our true direction.
