Researcher, Ph.D.
My two cents on LLMs for impossible languages
LLMs are mysterious creatures, and science is all over them. Noam Chomsky goes against the tide and finds them rather unimpressive. He argues that they cannot distinguish the possible from the impossible, and that they can learn natural and impossible languages with âequal facilityâ. Interestingly, an ACL 2024 best paper argues that Chomsky is wrong on this matter, bringing forward empirical evidence.
How cool is it to refute a world famous linguist with empirical evidence? Intrigued by this âbattle of the giantsâ (ACL best paper vs linguistic eminence), I looked more closely at the paper. In the end, however, I found myself siding with Chomsky.
First, letâs examine the paperâs main evidence, which seems to be this:
We find that models trained on possible languages learn more efficiently, evident from lower perplexities achieved in fewer training steps.
If youâre not sure what is meant by possible/impossible languages, simply take English as a possible language, and English with random word order as an impossible one. Informally, perplexity measures the ability to predict the next symbol in a sequence. It expresses uncertainty, so the lower the better.
And also letâs assume, for the sake of the paperâs argument, that learning efficiency is equivalent to Chomskyâs notion of âfacility to learn.â
In the nice graphs that are shown in the paper, we clearly see faster convergence and better perplexity on the possible languages. The appearance is that learning impossible languages is ânativelyâ harder for LLMs.
So can we call it a day, Chomsky is disproven, right? Well, it appears that there may be a bug with this way of reasoning.
Letâs first note that the paperâs training and testing data varies across setups. Therefore, having achieved lower perplexity doesnât necessarily equate to having learnt more efficiently, even if it would also be achieved in fewer steps. Similarly, the ability to efficiently learn any kind of patters doesnât guarantee equally lower perplexity on any kind of data.
For more conreteness, just consider the example of shuffled words. With random word order, we would expect that the best possible perplexity is naturally worse than if we didnât randomize: A next word in a random word sequence is harder to predict than a next word in a standard English sentence. English also contains regularities like SVO (Subject-Verb-Object), which are broken when shuffling. Therefore, a modelâs uncertainty when predicting sequences without any order can only be higher.
So the best perplexity level that a model can be expected to achieve is different across setups, even if the vocabulary of tokens is the same. Without knowing these expected levels we canât well compare the results and so itâs hard to judge whether Chomsky is wrong (or right).
In sum, Iâd say that Chomskyâs view on this matter has not been disproven. Maybe, if we view LLMs as universal computers, it even seems natural to think that LLMs can learn all kinds of patterns with about the same facility, be they âpossibleâ or âimpossibleâ. Whether you want to take it one step further and also agree with him that therefore LLMs canât tell us much about human language â thatâs very much up to you.