Has NLP Become Abstract Math?

# Has NLP Become Abstract Math? EMNLP 2025 was recently held in Suzhou, featuring 2455 main conference papers ([source](https://x.com/emnlpmeeting/status/1854951221309980681)). My X/Twitter and LinkedIn timelines over the past few weeks have been full of posts announcing NLP papers published in this conference, ranging across various NLP-related topics. A lot of such papers were indeed exceptional; skimming through the proceedings, I saw a lot of papers showing increases in model accuracies and evaluating language models in different scenarios. However, things became a bit more blurry in my mind, when I started to think "how these papers can change my life?" A lot of papers claim accuracy increases, but they mostly conduct experiment on small models, using GPT or other SotA models as an "non-achievable" target, only showing improvements in those small or task-specific models. Others merely evaluate SotA models on certain tasks, with no actionable recommendations on how an end user would use GenAI models differently. Some papers *benchmark models* and see how LLM X can perform action Y in domain Z, but just keep it as an exploration, rather than any real-world impact. Several papers provide datasets, particularly synthetic data generated using simulated *parrots* (read: agents), which has no usefulness beyond training similar models, which in turn leads to one of the paper types already mentioned above. This has been exacerbated by the pressure to publish in the field of NLP, and ML/AI in general, pushing researchers to only publish MVPs (Minimum Viable *Papers*) and miss a deeper connection with what is needed in the current world. Now that almost every tech-literate person in the world has used, or at least heard of, ChatGPT, this claim can be easily verified. How many of the papers in EMNLP this year, or even in similar conferences, can actually lead to improvements on how real humans use LLMs? GPT and other SotA models nowadays are *good enough* with relation to the models released through NLP publications; this means that 1) most of the contributions in the papers would be also possible with simple prompting of SotA models, possibly plus including external tools, and 2) the abilities still *impossible* in SotA models have also remained unaddressed, in recent NLP publications. We also see a trend among the researchers in adjacent fields, e.g., those who used to work on the intersection of HCI and NLP, who merely prompt an SotA simple or reasoning model, rather than fine-tuning their own specific models (effectively turning from HCI+NLP researchers to only-HCI researchers). This trend is not surprising: relying on prompting does not only make their research progress faster, but also leads to equal or better outcomes; the researchers would likely not be able to achieve a significantly higher accuracy score when spending time, budget, and effort on training task-specific models, compared to simply prompting an off-the-shelf LLM with minimal costs. And, before you ask, yes, this description also fits *me*, who chose to ditch the task-specific BERT models he had embedded in his EdTech tool, and went with prompting GPT-5 with reasoning for performing classification instead. I think there is a simple reason for this: leading GenAI companies (e.g., OpenAI, Anthropic, and Google, to name a few) are spending tons and tons of compute, which is simply not available to other research labs. They are focused on providing the end, versatile product (particularly in the form of their chatbots and APIs), and thus naturally direct all their resources towards that goal. The outcome is a set of SotA models, versatile and general-purpose enough that most attempts by the research labs to score higher than them using smaller and more manageable models fail, and any benchmark/evaluation paper would also lack relevance in a few years, or even months, due to the new generations of the SotA models coming at a very fast pace. But, does this mean that we should all abandon NLP research? Of course not (count me in the group of NLP fans!). I particularly think NLP is now undergoing a transformation, not dissimilar to what I think has happened in mathematics ages ago. Back in the day, e.g., before the 19th century, a lot of mathematics research progress seemed to be following real-world problems. An specific example can be the invention of Algebra by [Al-Khwarizmi](https://en.wikipedia.org/wiki/Al-Khwarizmi#Algebra), motivated by the real-world problems of trade, inheritance, and surveying. This was expected; the way humans lived was being developed over time, necessitating new concepts to come along. However, current mathematics research is far from that; it heavily focuses on more abstract concepts, far from having *immediate* real-world effects. Before you stop me, this is not a problem *at all*! Many around the world enjoy the current state of mathematics research, even though it does not have *immediate* effects. A research field does *not* need to have *immediate* effects on real-world status of how humans live, in order to be valuable and worth exploring; we, humans, can enjoy abstract research as well. NLP research used to have more *immediate* real-world effect, and now this has been reduced, just like what happened in mathematics. However, it is still valuable, and maybe even more *fun*, compared to several years ago where NLP research *had* to have some real-world effects in order to be taken more seriously. (Note that I'm ignoring the other differences of NLP and math domains, e.g., the type of reasoning involved for conducting research, for the sake of simplicity in arguments.) It should also be noted that NLP has still *not totally* become like abstract math; still, requirements such as privacy, inference at scale, and more, necessitate NLP research with real-world immediate outcomes. However, the idea that "GPT, and SotA models, will make the NLP research obsolete," which is a common fear among PhD students in NLP-related domains, is only true if one considers NLP research as equal to "NLP research with immediate real-world outcomes." It does not need to be! Mathematics research is so much fun for math nerds, let it be fun; similarly, NLP research is so much fun for NLP nerds, let it be fun as well! One more note (and complaint) before concluding here: due to the academic system and the publish-or-perish mentality, we are seeing a lot of *benchmarking* NLP papers that follow a simple tried-and-tested formula of A) picking a certain domain, B) generating or finding dataset(s) on that domain, C) evaluating several LLMs on that domain, and D) reporting results with accuracy metrics and fancy plots. I think a lot of these papers might be harmful for the NLP community, for several reasons. First, they add nothing of use to the NLP landscape, increasing the amount of *noise* in conference proceedings. Second, they spend a lot of compute power for no useful outcome, which is a highly non-optimal way of spending research budget (possibly funded by taxpayer money) and also has unnecessary side effects (e.g., environmental impact). Third, the domains used in these papers are often highly niche, with a low relevance to many real-world use cases. And, lastly, the results obtained in these papers are mostly highly context-dependent; changes in the base LLMs used, in the type of prompting, or in cases, even after re-running the model with a different seed, can highly alter the results, rendering the published results useless. What I would instead prefer to see is to stop publishing papers for the sake of publishing papers. I want to see less noise added to the NLP conference proceedings over time, and instead, seeing works that are, as a minimum requirement, *fun*. What I would want to see is, for example: a paper trying to speed up or improve the decoding process by playing with how LLMs are designed; a paper that sees how architectural or prompting changes in LLMs can impact instruction-following in a positive way; a paper that analyzes real-world risks of deploying LLMs in practice (e.g., bias in models); a paper that analyzes how LLMs, on a larger picture, can imitate human behaviors and act as truthful simulation machines, without blindly assuming so; a paper that is not rendered useless by simple changes in parameters or the base LLM; a paper that tells me "wow, I can use the same approach to improve my pipeline" after reading the abstract; and, a paper that contributes something useful to the community, so that I can cite it, not through saying "X et al. found LLMs are not good at Y" but through saying "X et al. found that method Z improves how LLMs work at Y." NLP might become similar to abstract math soon. Let's embrace it and conduct *fun* research that we can be proud of in the future, rather than trying to somehow claim our goal is to come up with approaches useful in real-life scenarios, while our real intention is actually bumping up our h-index and publication count.