Publishing and communicating research in AI/ML is fundamentally broken

Why researchers should care, and four proposals for how to fix it

Jul 07, 2025

The premier venues for publishing and communicating research in AI and ML, the conference main tracks (NeurIPS, ICLR, ICML, etc.) are fundamentally broken.

Google Gemini helpfully informed me that: “NeurIPS 2025 conference received over 27,000 submissions, it's estimated that reviewing all submissions, even with a quick 30-minute review per paper, would require approximately 13,500 person-hours.”
NeurIPS official LLM guidelines for authors state: “You are welcome to use any tool, including LLMs, to prepare for your publications. However, you must describe the use of these tools clearly if they are part of your methodology.”
Whereas for reviewers the guidelines stipulate: “Do not talk about or share submissions with anyone or any LLMs.”

There is an iceberg in the water and we’ve already crashed directly into it. It’s never been easier to generate content (papers, experiments, code)1. Students, faculty, and many industry researchers are under immense pressure to submit as much work as possible to every conference to increase odds of acceptance.

X bubbles with discontented ML conference reviewers

Some researchers have entirely abandoned traditional publishing venues in favor of “going direct” through preprint servers, technical reports, blog posts, and social media. Meanwhile, reviewers and conferences cannot use any of the available tools to fight back against the deluge. Should we even care?

Fixing these issues matters because the AI/ML research community (which includes both academia and industry) should be the vanguard for addressing the very real societal upheaval that our collective work is inciting. If we can’t clean up our own backyard - and fix the way that our own work is breaking the ways in which we evaluate and communicate that work - what does that say about our prospects for rolling out AI across sectors in ways that benefits humanity?

If you made it past the intro, I assume you’re either a fellow disgruntled member of the ML conference reviewer pool looking to commiserate (and possibly think about solutions), a program chair who is only 99% convinced that the system is broken beyond repair and thinks there may still be hope, or an AI enthusiast with a morbid curiosity about the inner workings of the field.

It’s LLMs all the way down

From the NeurIPS 2025 program chairs:

The maximum number of papers per reviewer can be 6…We will be strictly imposing Responsible Reviewing this year…If your review is identified as insufficient, you will be prompted by your AC to improve and revise your review. You are obliged to do so and communicate your efforts with your AC promptly in order not to find yourself on a blacklist…Grossly irresponsible reviewers may have all their own submissions desk-rejected from NeurIPS’25 as an ultimate penalty.

In other words - review 54 total pages of highly technical content, while also handling your own paper submissions, and convince us you spent a long time doing so, regardless of the quality of the papers you’re reviewing, upon penalty of blacklisting.

Why should the reviewer pool spend time reviewing and writing thoughtful essays about manuscripts that are not ready for review, and/or generated mostly by LLMs? Honestly, if I thought my reviews were going to be turned into human feedback data to make better paper-writing and reviewing LLMs, I’d at least understand the value (more on that later).

And because reviewers have increasing demands on their reviewing time, without being able to take (sanctioned) advantage of the productivity gains that authors can, time spent producing detailed reviews of low-quality or incomplete submissions necessarily detracts from reviewing papers that warrant careful examination.

It is naive to think that reviewers are not violating these guidelines left and right, outsourcing their intractable reviewing burden to LLMs2. Moreover, it’s rare to have a submitted paper that isn’t already on arXiv and perfectly viewable by LLMs.

Is the system worth saving?

Always a good question to ask, especially in this case, when things are changing rapidly and the value and relevance of “traditional” scientific publishing is constantly being questioned. I wrote about why I think writing still matters for researchers in a recent post.

ML conferences and journals provide a curation service above all else, by polling mostly random samples of researchers and asking “do you think other people in your field should pay attention to this?” Venues that are more successful at this have higher prestige. It’s been a difficult model to disrupt, with distill.pub being a noble, but ultimately doomed, example.

Most attempts to replace traditional publishing have been by well-meaning academics who fundamentally misunderstand the competition (legacy journals and archival conferences), thinking that new approaches merely need to represent better ways of reviewing and sharing high-quality research. But traditional venues are brands that allocate clout via artificial scarcity (e.g., maintaining 20-25% acceptance rates), rather than places where researchers first learn about new results.

Of course, it could be that the main conferences and journals implode, and new researchers entering the field will fight it out online against established researchers and “AI scientists.” Peer review will be LLMs talking to other LLMs, with occasional comments from humans mixed in. As AI eats other fields, the same will eventually happen to physics, chemistry, biology, and everything else. To some, I’m describing a natural and unavoidable evolution of the scientific process; to others, a dystopia.

How to fix it

If we accept that ML conferences and scientific journals occupy an evolutionary niche and something that looks like them will continue to exist and be important for how we establish scientific reputation, especially for early-career researchers and academics, what do we do?

Fix the incentives around JMLR and TMLR. JMLR and TMLR arguably solve most of these problems already. TMLR “emphasizes technical correctness over subjective significance, to ensure that we facilitate scientific discourse on topics that may not yet be accepted in mainstream venues but may be important in the future” and “employ[s] a rolling submission process, shortened review period, flexible timelines, and variable manuscript length, to enable deep and sustained interactions among authors, reviewers, editors and readers. This leads to a high level of quality and rigor for every published article.” Unfortunately, as Neel Nanda pointed out on Twitter, there is a perceived “prestige gap” within the researcher community between the main conferences and TMLR3. Whether you personally agree with that assessment is moot. Early-career researchers who write the bulk of submitted papers do, and they are incentivized to shoot for conference papers above all else. Prestige is a collective belief, and we could fix incentives by having a consortium of well-known, well-respected academic and industry researchers agree to send their best work to TMLR.
Adopt best practices from the scientific journal publishing system. A shocking statement that presupposes there are best practices to borrow from scientific journals (e.g., Cell, Nature, Science). I believe there are at least some things worth copying. Namely, human-driven desk rejection. ML conferences have a brutal automatic desk-reject system where submissions are not considered if they violate some easily detectable formatting rules, page limits, or author registration requirements4. Even better would be if ML venues implemented something similar to what journals do, where an editor asks “does this work belong in this venue? are there any obvious reasons why it is unlikely to be worth sending out for review?” Area chairs (or even LLMs with specific prompts!) could implement some basic filters to flag submissions that are likely to not warrant detailed reviewer feedback at this time. This allays concerns about papers being auto rejected based on a single subjective human (or LLM) opinion, while giving reviewers the chance to contradict that assessment if they feel strongly, but not penalize them if they agree and don’t write an extensive review.
Fix the reviewer productivity problem by giving responsible access to better tools. This should be a no-brainer opportunity for one or more of the frontier model development teams to strike an agreement with the major ML conferences. Integrate frontier models (Qwen, Gemini, ChatGPT, Claude, Llama, Mistral, etc.) directly into OpenReview, with a clear agreement about not training on unpublished work, auditing, and highlighting parts of a review that are LLM-generated. This is a great branding opportunity for a frontier model developer to fix a real problem in the community in a very visible way, at substantially lower cost than buying and staffing a booth in the conference expo center. Submitted papers could come with a score from a service like pangram to inform reviewers and Area Chairs which parts and how much of a paper is LLM-generated.
Monetize reviews as training data for reward models. I saved the most controversial proposal for last. The reviewer community is a disorganized collective that effectively provides free annotations (like a non-profit version of ScaleAI) in exchange for: a) a sense of moral satisfaction at participating in what some believe is a key part of the scientific process, b) an honest desire to read good research and sharpen thinking by critiquing it, or most likely c) to fulfill some obligation necessary to get their own papers reviewed. ICLR is the only conference where reviews are made fully public on OpenReview by default, so we’re already feeding LLMs with reviewer feedback in the form of the written review, numeric scores, and binary rewards (accept or reject). Some venues like the Learning on Graphs conference compensate reviewers for high quality reviews. In tandem with proposal #3 above, a frontier model company could pay to have exclusive access to reviews as training data, in exchange for subsidizing the conference registration or travel costs of reviewers (i.e., paying them for work, but via discounts or vouchers, which people feel less weird about). For any company that is serious about building “AI scientists”, it’s hard to imagine a more valuable, scalable source of training data than expert annotations of new research papers.

Thanks for reading Nathan's Substack! This post is public so feel free to share it.

It’s clear that LLMs are already an intermediary for most researchers in understanding the world and producing new understanding. Many research groups are actively pursuing mostly or completely AI-driven research and writing. There is a future in which ML conferences, and other traditional modes of scientific communication, continue to atrophy and devolve into irrelevance and meaningless status games. There’s an alternate future where we embrace the change, adapt to it, and build or rebuild our systems and institutions to serve an actually useful purpose: to advance and spread knowledge.

Getting in touch

If you liked this post or have any questions, feel free to reach out over email or connect with me on LinkedIn and X / Twitter.

Acknowledgements

Thanks to Kyunghyun Cho for reading early drafts of this post and providing helpful feedback.

The entire problem is another instance of the generation-verification gap, because it’s significantly easier to generate a paper (or something that looks like one) or a research result than it is say if a paper is truly original, significant, clear, and high quality (the scoring rubric for NeurIPS).

I don’t do this, because authors can already get LLM feedback and don’t need me to do it for them. Instead I have to write Substack posts like this one, where I shake my fist at the sky.

It should be noted that JMLR, currently led by editors-in-chief Francis Bach (Inria) and David Blei), is considered very prestigious by serious ML researchers.

This is fine, although giving authors a chance to “safety submit” and correct these errors would be a welcome improvement.

Lood

Jul 7Edited

On the problem of “too many to review”:

What about a fee for submission (e.g $100) that is waived upon acceptance?

It can be lowered substantially for accessibility reasons (e.g $25 per paper), and the proceeds can go towards improving accessibility in the field (e.g by hosting the Deep Learning Indaba). (Of course we know that reviews are pretty random but still)

You do mention that people are under pressure to submit work to conferences, and this may discourage massive submission volumes per lab.

Expand full comment

Nathan's Substack