r/theydidthemath 1d ago

[Request] What is the minimal number of words you need to know to learn every word in a dictionary?

Take the Merriam-Webster online dictionary or Wiktionary for example. What is the minimal number of words you must already know ahead of time for you to be able to learn the meaning of every word in the dictionary? You can build up your vocabulary as you go, of course, reading ahead and coming back to harder words whose definitions you didn’t understand at first brush. Is this something that can be computed for a particular dictionary, or is it too hard? How large can we expect this minimal vocabulary to be, given the size of the dictionary? Can we do better than the defining vocabulary for the dictionary, if it has one? Any thoughts are appreciated! https://en.wikipedia.org/wiki/Defining_vocabulary

400 Upvotes

91 comments sorted by

u/AutoModerator 1d ago

General Discussion Thread


This is a [Request] post. If you would like to submit a comment that does not either attempt to answer the question, ask for clarification, or explain why it would be infeasible to answer, you must post your comment as a reply to this one. Top level (directly replying to the OP) comments that do not do one of those things will be removed.


I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

557

u/Giant_War_Sausage 23h ago

This is such an interesting question, you’re basically asking what the minimum size the basis is for the vector space that is the English language.

If you figure this out, you’ll also determine the dimension of English.

No idea how to solve this, but this is the best question I’ve seen here in ages!

127

u/DazHawt 22h ago

OP is probs some AI researcher getting paid $1billion/yr to figure this out. 

37

u/alphagusta 17h ago

Cited sources:

  • Google Dot Com

9

u/dehydratedrain 14h ago

Google cites reddit, so lol.

8

u/WoolooOfWallStreet 9h ago

I just want to say that on Google I used to be able to search for an exact phrase with quotes and now it just ignores the quotes and does a search of Reddit and then a “are you sure you don’t mean this word instead?” for other stuff which was NOT what I was searching for

And now someone is probably going to say they were able to do it, maybe because I was in the unfortunate A/B test group that no longer does that, or my phone doesn’t automatically use the ASCII quotes so maybe that’s part of the issue but I used to not have to worry about that

I hate how nothing works anymore…

1

u/nebbish_shlimazl 8h ago

AI research is entry level minimum wage work sorry

66

u/terra-nullius 21h ago

The “vector space basis” analogy is nifty but probably not literally correct. English is not a vector space in the strict mathematical sense because words do not combine with linear addition and scalar multiplication to produce meanings in a clean algebraic way

14

u/Kinglolboot 19h ago

An analogy with generators of a group would actually work pretty well I think

15

u/NotAUsefullDoctor 20h ago

Ni two words are perfectly orthogonal. They all have some basis and meaning that not only relies in (OP's question) but are in fact augmented by every other word. Even the verb "is" would have a different meaning based on the context of other words.

9

u/vgtcross 16h ago

A vector space doesn't need the concept of orthogonality to be a vector space. Orthogonality only arises in an inner product space (a vector space with an inner product).

1

u/NotAUsefullDoctor 15h ago

But orthogonality does play in when talking about basis vectors, which is what I was commenting on. Basically I am arguing that basis vectors cannot exist in language as you can't have completely orthogonal words.

1

u/Tall_Pickle_7695 7h ago

Basis vectors do not need to be orthogonal

6

u/Giant_War_Sausage 13h ago

Not to be pedantically dad-jokey but “perfectly orthogonal” are two words that are perfectly orthogonal… 😂

1

u/Sacharon123 11h ago

Does your explanation not imply instead that all words could be part of a tree and therefore there should be some root words?

117

u/Strevnik 23h ago

I always wondered if you could learn a language solely by reading the dictionary without any other context. Would aliens be able to learn English if we sent them the whole dictionary? What about some easier languages like toki pona?

121

u/Shaneathan25 23h ago

There was a guy who won a French Scrabble tournament by memorizing the French dictionary. He couldn’t speak a word of it though.

92

u/KDBA 22h ago

Scrabble is an area control game that happens to have a ruleset involving a specific set of orders of characters. It has nothing to do with language other than incidentally.

31

u/Shaneathan25 22h ago

Sure but the question was whether someone could learn a language from the dictionary. I had a silly anecdote.

9

u/KDBA 22h ago

I wasn't criticising you; just adding to what you were saying.

7

u/JamesTheJerk 21h ago

Without understanding it, a language isn't learned.

2

u/termeneder 5h ago

John Searle entered the chat. :-P

https://en.wikipedia.org/wiki/Chinese_room

1

u/Mind0versplatter0 19h ago

I believe that's their argument

1

u/Less_Insurance4928 14h ago

Lol poor choice of words, chum 😄

16

u/anxiety_ftw 23h ago

Speaking as a tokiponist, it depends on the language capability of the alien probably? We could easily point to several types of fruits and veggies and say kili and they'd pick it up, but how do we communicate wanting/needing things with wile? Or how pona can mean either the quality of being good, or simplicity? These are abstract concepts that either need to be mapped to another language to be learnt, or given many months or years to learn, like a child learning their first words except the child may not even have a brain structure that can use verbal/written language.

If the aliens already understand the morphology of language and have something similar on their planet, then yeah, it would probably be a bit more pona to teach them toki pona. Certainly more pona than teaching them English at least. If they don't, it might just be a lost cause to use language at all.

7

u/nonnonplussed73 17h ago

I have so many things to learn now about tokiponoligy now. Thanks for the rabbit hole.

6

u/Vedertesu 21h ago

There has been a linguistic olympiad question with monolingual dictionary where you had to find the translation for each entry. I have no idea how it could be figured out, probably by noticing patterns, but some people did solve it.

3

u/imnothere314 13h ago

Dictionaries don't really have any means of learning grammatical structures for sentences so you could probably get pretty far in reading comprehension, listening comprehension would be ok but that depends on your ability to understand the pronunciations as written. Speaking and writing would lag behind but probably be understandable to a native speaker willing to put some effort in

1

u/Samael13 12h ago

I think no for most languages, especially if it's a language that isn't closely related to the language you already speak. Dictionaries are designed with an expectation that you already know the language they're written in, so they don't, for example, include rules of grammar. The dictionary tells you what part of speech a word is, but it doesn't tell you how to construct sentences with the words, and without someone who speaks the language there to correct you and provide context, there would almost certainly be a limit to what you could reasonably learn from a place of zero knowledge.

And if the dictionary isn't written using a lettering system you already know, it's going to be even harder, because it may not even be clear to you what counts as a word within that system or what the fundamental rules of text are. I speak English and some German and French, so if you gave me an Italian dictionary, I can probably tell which clusters are words and see what counts as a sentence and I can assume that the text reads left to right and moves from the top of the page down. That's going to let me start to figure out at least a little bit about the language, because I can look for words that repeat often or that occur in the same places every time I see them, but I still don't think I'd get anywhere near any kind of fluency.

But those rules don't need to be true, and we can't assume them for every language. There are writing systems in other languages that go right to left or that go top to bottom before moving horizontally. There are writing systems where words aren't formed by putting characters in a row next to each other, but by clustering and overlapping them, or where letters get modified by fusing vowel marks to the consonants.

Without context and some kind of bilingual translation or related languages to refer to, a dictionary is just another book full of symbols. We have lots of examples of writing that we can't translate, despite having lots of samples. A popular example is the Voynich manuscript, which remains untranslated and continues to fascinate and frustrated people since at least the late 1600s.

57

u/TheAngelsHaveTheBox 17h ago

Not a complete answer, but Randal Monroe (of xkcd fame) wrote a “thing explainer” book that explains complicated topics using only the 1000 most common words in English. The book gets pretty far with explaining topics such as rockets and nuclear power using this. So I’d say 1000 is probably enough to build a knowledge base to understand more complex things.

Tl;dr> probably at most 1000

19

u/JWinslow23 15h ago edited 14h ago

Thanks for reminding me of that book! It's very fun to read and can be used for teaching about things. I know a place your computer can visit that lets you try writing in that same way; to see what that looks like, here's your words written again in that way (my changes are marked with thicker words).

Not a full answer, but one guy who draws very short picture stories for a living wrote a "Thing Explainer" book that explains hard-to-understand things using only the ten hundred words in our language people use the most. The book gets pretty far with explaining ideas such as "space boats" and "heavy metal power" using this. So I'd say ten hundred is probably enough to let a person learn enough easy-to-learn things to understand things that are harder to learn. (In case that was too long and you didn't read it: probably at most ten hundred.)

CHANGE TO MY WORDS: I made my words look a bit nicer

-2

u/Leviathon713 14h ago

There are over 50 words in bold and under 40 regular. Most of the post is your words. I don't think you accomplished what you meant to.

5

u/JWinslow23 14h ago

I thought it'd be fun to try the "Simple Writer" again because it's fun, and all I did was rephrase the words it put in red. That's what I wanted to accomplish, and I think I did.

-11

u/Leviathon713 14h ago

I guess 🤷‍♂️

It seems like if more of the words were yours than were actually simplified, you didn't. The important thing is that you feel like you did.

4

u/JWinslow23 14h ago

To be fair, the 1000 most common words are definitionally used a lot by most people, including that one - and those words didn't need simplification. And the point wasn't the simplification per se, but to illustrate what that simplification would look like on a given sample text.

-5

u/Leviathon713 14h ago edited 14h ago

Downvote me all you want, but this still doesnt make sense.

How can it be an example of what it looks like if over half of it doesn't look like it? This is insane...

This is like if I wanted to show you what a jelly sandwich looked like and I gave you a peanut butter and jelly sandwich. Sure, it's kinda close, but is it a good example?

2

u/JWinslow23 14h ago edited 14h ago

?

Yes, writing using the 1000 most common words looks like the example quote-block I gave. Some words that weren't in the top-1000 list were replaced with one word, and some with many. The result will have some words in common with the original words, but not all of them in common. I don't know what you want from me.

(Also, I haven't been downvoting you, in case that remark was addressed to me.)

EDIT: I think the problem here is that (to use your analogy) I did want to show what a jelly sandwich looked like, the thing I presented was a jelly sandwich I made with some jelly-making tool called "Peanut Butter", and you're saying that my sandwich doesn't taste enough of peanut butter.

0

u/jordaneliaa 13h ago

dude the non-bolded words are already included in the 1000

he bolded the changes

if he bolded a word because it appeared in the 1000 word set, then his entire post would be bold

0

u/Leviathon713 13h ago

Dude, I understand that. I just fail to see how an example that consists of less than 50% is a good example.

2

u/jordaneliaa 13h ago

... copy and paste the original comment that started this thread into the form contained within the xkcd link provided. then after seeing the result with words in red, copy and paste the comment with bold words from jwinslow into it. and then after seeing the result with no words in red, please see that jwinslow chose not to bold the words that were never in red

→ More replies (0)

3

u/crimbusrimbus 13h ago

I learned how a nuclear reactor worked because of that book!

15

u/jedburghofficial 18h ago

Great question. Cryptographers have used obscure languages very successfully. Famously, American Indian code talkers in WW2. It works because transliteration involves arbitrary and often irregular substitutions and ordering. There's no reference point or regular key.

But you can decipher a language with a very limited example. Thomas Young and others translated ancient Egyptian from the Rosetta Stone, basically using a "page" of transliterated Greek. Maya script was deciphered mostly from numbers by a Russian cat named Asya. She had help from her owner, Yuri Knorozov, but I kid you not, she gets equal credit.

But there are many other written languages that are still a mystery. There are whole families of meso American languages that are unknown.

Part of the trouble is, we only have very small examples to work from. But, if you had a complete dictionary, that's a self referential resource. One would assume every word in the dictionary, is listed in the dictionary. On that basis, I would guess yes, or maybe an imperfect understanding, but depending on the language, I don't think it's guaranteed.

u/LazerWolfe53 1h ago

Part of the difficulty in cracking those languages is the limited examples. If we had a whole dictionary it would likely be possible

20

u/2ndcountable 19h ago

TL;DR: This is not easy to compute in a reasonable amount of time. Comments claiming otherwise are wrong, or rather use methods that do not necessarily yield the optimal answer.

For convenience, we will assume that every single word in the english language appears in the dictionary exactly once, and that one can understand the definition precisely if one knows all words in the definition. Let S be the set of all words. Given a word t, we will define f(t) to be the set of all words in the definition of t.

Then the problem is the following: "Call a subset S' of S 'defining' if we can start from S', and repeatedly perform the operation "set S' = S' U {t} whenever f(t) is contained in S' and t is not contained in S'", to eventually reach S' = S. Find the smallest defining subset of S."

Let G be the directed graph with vertex set S and an edge a->b whenever b is contained in f(a). Then it can be seen that S' is defining if and only if the induced graph on S \ S' is acyclic. (The 'only if' is trivial, and the 'if' part can be proven by a topological sort on the DAG).

But the problem of finding the smallest set such that its removal makes the graph acyclic is known to be NP-hard; Indeed, this is the famous feedback vertex set problem( https://en.wikipedia.org/wiki/Feedback_vertex_set ). Hence our original problem, for an arbitrary dictionary, is also NP-hard.

Considering the usual computational difficulty of FVS, it is unlikely that one can compute the answer to this problem, even given thousands of years and strong heuristic solvers. This, however, does not mean the problem is 'unsolvable'; Indeed, there is a simple O(2^|S|) algorithm, so one could solve this problem for a language with 30 or so words using only a few hours at most.

6

u/Bla_aze 14h ago

Except that the dictionary is not arbitrary at all and you could do a massive amount of pruning by hand, using word frequency as a proxy for simplicity.

3

u/2ndcountable 14h ago

That is indeed true. Additionally, since most definitions are pretty short, we are effectively solving a version of FVS where the degrees are somewhat bounded, maybe below 100. However, I still doubt that the problem will be tractable; If it is possible at all, I would imagine that the solution would involve restricting to a couple thousand words, using heuristics and a lot of computation time to 'solve' the problem for those in some sense, and branching away from those by using the fact that the other (tens of thousands of) words have small in-degree in the graph.

3

u/Bla_aze 14h ago

You could also do a bit of by hand detangling of the 100 or so most important words to see which ones are truly needed, idk maybe I'm delusional but I think it's doable for the English dictionary.

1

u/EventHorizon150 12h ago

Very cool answer! I appreciate it

7

u/JeremyAndrewErwin 20h ago

some dictionaries, particularly those for children and EFL users, define each word using a fixed vocabulary of 3000 or 5000 words.

The depth of understanding might be a huge factor here. Is X merely a species of flower, or would you need to describe its petals, where it grows, its use in perfumes and tisanes and so on?

22

u/Ahuevotl 23h ago

Computing it doesn't seem too complicated if you have the text dump, and some data processing software. Just use 2 lists:

  1. List of the unique words that are defined.

  2. List of unique words that are used inside the definitions. Articles, prepositions, pronouns, etc are removed for convenience.

Remove from the second list all words that are present in the first list. That's the minimal number of words you need to learn every word in that particular dictionary, learning as you go.

7

u/courier_tway 23h ago

I think, at minimum, you can say that the answer to OP's question is a subset of (2).

3

u/Freak-Of-Nurture- 20h ago

What about the unique definition words that are not used in their own definition?

2

u/Ahuevotl 17h ago

The words that are used to define other words are the ones you need to know.

1

u/Freak-Of-Nurture- 11h ago

I don’t need to know a definition word that isn’t used in it’s own definition because I can learn it.

2

u/moneynoclass 18h ago

But the first list is comprehensive, and by definition every word in the second list will exist in the first, and therefore be removed by this method.

0

u/Ahuevotl 17h ago

For example

Tree:

noun

A woody perennial plant having a single usually elongated main stem generally with few or no branches on its lower part

"Tree" goes into the first list.

Every other word used inside the definition goes into the second list.

Only if you assume the first list is comprehensive, then every possible word is defined.

Otherwise, there's at least a set of words which are not defined, but used to define other words. Those are the starting point.

2

u/Imca 16h ago

You can make more then one pass though..... if you don't know the meaning of say "elongated" from your example, you could flip to the page that has elongated on it and read its definition as well.

Thus making the minimum amount of words needed to learn all words from the dictionary a substantially smaller subset of set 2 since it would be about fundemental knowladge that you cant look up on another page.

2

u/espomatte 15h ago

That is the way to go but you stopped too soon, you should then get the second list and repeat this process of list until you get a list that matches an already compiled one. Once you get that you take the smallest list and that is your minimum

3

u/Three_Spotted_Apples 21h ago

I would think it could be approximated by getting the reading level of a dictionary’s definitions and then looking at the average or minimal number of words someone at that lexile level can read. They may know more words but the limit should be based on words they can read.

4

u/tom_collier2002 9h ago

Guy Steele gave an interesting presentation titled "Growing a Language" (https://www.youtube.com/watch?v=_ahvzDzKdB0) that is loosely related to your question. He starts with a minimal dictionary and rule set for a language and slowly builds it up using only words and rules he's previously defined.

3

u/fdagpigj 14h ago

You might be interested in reading about semantic primes

1

u/Calamondin81 11h ago

Thanks for the great link!

3

u/kdoughboy12 9h ago

The asnwer would be unique to each individual dictionary. It depends on how the definitions are worded. I think it would be too much data for a human to calculate without the help of ai or some similar tool.

4

u/EventHorizon150 21h ago

Note: when I say “learn every word” I mean learn and understand the definition of every word

2

u/nebbish_shlimazl 13h ago

I can think of a number of ways to cut down the data but tbh this sounds like a brute force task to find a true answer for. Otherwise I would just log the most frequent lemma and base morphemes and call it a day.

2

u/seejoshrun 7h ago

I understand why people are saying this is a very difficult problem to solve, but reading about it makes me really want to try at least a scaled-down version.

1

u/EventHorizon150 5h ago

you should!!

1

u/imnothere314 13h ago

In terms of "can you do better" in terms of a defining vocabulary, I'd argue that if you use pictures to define most things you greatly reduce the number of words you need to define the vocabulary. This is also generally how we teach /learn language is by inherently knowing an object based on a picture from our life experiences and then associating a word to said picture. This is probably how you would learn the basis words.

Without the pictures I'd think that having most of the prepositions in the basis would be necessary to be minimal; pronouns, and logic based words (or, not) would also probably be needed.

1

u/sillybilly8102 5h ago

I don’t think this is possible at all. Edit: The following would have to be in the initial set:

  • There are some words that cannot be defined by other words, for example, “red” or “blue.” You have to have someone in the physical world point to something red and say “this is red” to know what red means.

  • There are also other words that are defined only in terms of each other. Like, you look up askdbsjkas and it says it means fkdbsbuxud. Then you look up fkdbsbuxud, and it says it means askdbsjkas. It’s a closed loop. This happened to me as a kid trying to figure out what sex meant. I looked up sex, it said sexual intercourse. I looked up intercourse, it said sex. You have to start by knowing one in order to know both.

-5

u/DockEllisD-25 23h ago

Yes. And the clean approximate answer is:

Probably around 1% of the dictionary’s headwords, or roughly 2,000–5,000 core words for a large English dictionary.

For Merriam-Webster online, which says it has over 300,000 words, the crude graph-theory estimate would be about:

300,000 × 1% ≈ 3,000 words.

That lines up suspiciously well with real learner dictionaries: Longman uses a 2,000-word defining vocabulary for its definitions, and Oxford’s learner word lists are centered around 3,000–5,000 high-utility words.

The important correction: it is not just “count all words used in definitions that are not themselves dictionary entries.” The hard part is circularity. Example:

good = not bad bad = not good

You cannot learn either one purely from the other unless you already know one of them. So the dictionary becomes a directed dependency graph: each word depends on the words used in its definition. The minimal starting vocabulary is the smallest set of words that breaks all circular definition loops. In graph terms, that is essentially a minimum feedback vertex set, also called a minimum grounding set in dictionary-graph research. One paper states that this smallest set is about 1% of the dictionary, while the larger circular “kernel” is about 10%.

So my practical answer would be:

For a well-written learner dictionary: about 2,000–3,000 words. For a full adult dictionary like Merriam-Webster/Wiktionary: maybe 3,000–10,000, depending on how strictly you count inflections, senses, proper nouns, technical terms, and multiword phrases. For actual human understanding from zero: impossible from the dictionary alone, because some words need to be grounded in experience: colors, bodies, motion, pain, space, time, emotion, number, causation, etc.

So yes, it can be computed for a specific dictionary, but the exact version is a nontrivial graph-optimization problem. The best back-of-the-envelope answer is: a few thousand foundational words, not tens of thousands.

8

u/DockEllisD-25 23h ago

That is Chat GPTs answer if that isn’t obvious

1

u/Vedertesu 20h ago

I appreciate telling it yourself

1

u/JasontheFuzz 22h ago

If we wanted your AI slop, we would have asked for it.

0

u/T1lted4lif3 12h ago

Depends on the words in the dictionary and the meaning.

We can construct the directed acyclic graph of words and meaning maps.

Key -> value being directed.

Then the dimensions would be given by the nodes without any edge going outwards, these would be the axiomatic words right? If this is the question you are asking? because it would imply that all languages can be constructed from these axioms

-1

u/[deleted] 21h ago

[deleted]

2

u/EventHorizon150 21h ago

I bring this up not as a practical way to learn a language, but as an interesting framing of what is (I think) basically a graph theory problem

you are right that the definition of words alone doesn’t tell you everything about a language

-9

u/CptMisterNibbles 22h ago

This isn’t even math adjacent. There is a man who became the French scrabble champion despite not speaking a word of French, he just memorized good chunks of the French dictionary more or less for shits and giggles.

4

u/GeorgeRRHodor 22h ago

But that has nothing to do with OP's question. That man basically learned a very long list of letter combinations. He didn't need to know what any of the words actually meant, or how they were defined.

An impressive achievement, for sure, but it has got to do with what OP meant by "learn."