How it learned everything it knows

Before an AI can answer your questions, someone has to teach it. Not the way you would teach a child. No classroom, no textbooks, no patient explanations. AI learns by exposure, at a scale that is almost impossible to imagine.

The AI behind ChatGPT was trained on roughly 300 billion words. Books, articles, websites, Wikipedia, code, forums, recipes, legal documents, song lyrics, scientific papers, news archives. If you read nonstop for your entire life, eight hours a day every day, you would cover maybe a billion words total. This AI absorbed 300 times that, in a matter of months, on thousands of computers running in parallel.

Think about that scale for a moment. The Library of Congress contains about 170 million items. The average book is around 80,000 words. Even if every single item in that library were a book, that entire library would represent about 13.6 trillion words of text. The training data for a large AI system is in that range. It is a genuinely staggering amount of human writing, compressed into a piece of software.

The training process works like this: show the AI a sentence with the last word hidden, ask it to guess the missing word, check whether it was right, adjust the internal math slightly, and repeat. Billions of times. Across hundreds of computers running for months. It is like a child learning to read by filling in blanks in a trillion different sentences, except instead of months, it happens at machine speed.

After enough repetition, something remarkable happens. The AI does not just learn which words follow which other words. It develops a deep statistical model of how language works: what topics connect to what, how formal writing differs from casual writing, what a persuasive argument looks like versus a weak one, what a question usually expects in return, how instructions differ from stories, how medical writing differs from sports coverage.

That is why it can write a poem, explain a tax form, or help you word a difficult email. It is not pulling from a database of pre-written poems. It is not copying text it memorized. It is generating new text based on an extraordinarily sophisticated understanding of what good writing looks like, because it absorbed so many examples of good writing that the patterns became embedded in its math.

“Every capability it has comes from human knowledge; every limitation reflects gaps or biases in what it was trained on.”

Ask it to write a poem in the style of Robert Frost and it draws on everything it absorbed about Frost's particular rhythms and subject matter: the rural New England settings, the deceptively simple language, the undercurrent of something darker beneath the pastoral surface. Ask it to explain Medicare Part D and it draws on thousands of explanations of Medicare it encountered during training, synthesizing them into something coherent for you specifically.

Here is the important part: it learned all of this from text that humans wrote. Every capability it has comes from human knowledge. Every limitation reflects gaps or biases in what it was trained on. If most of the text it encountered about a topic had a particular slant, that shows up in its responses. If the training data underrepresented certain perspectives or communities, those gaps show up too. Every bias in its answers came from somewhere in those hundreds of billions of words, written by humans, carrying all the blind spots and assumptions we have built into our writing over centuries.

There is also a time boundary, and this one trips people up regularly. The training data has a cutoff date. After that date, the AI simply does not know what happened. If a major health guideline changed last year, the AI might still give you the old version, with complete confidence. If a business closed, the AI might still describe it as open. If a law changed, the AI might cite the old version.

The AI is not being deceptive. It genuinely does not know what it does not know. It is like asking someone who spent years reading everything ever written, then went into a sealed room with no news access. Brilliant on everything up to that point. Completely unaware of anything that happened since.

Knowing when the AI's training data ends is useful information. For most general knowledge topics, the cutoff does not matter much. For anything time-sensitive, including current medications, recent legal changes, current prices, or recent news events, always check a current source before acting on what the AI tells you.

The AI is essentially a mirror of human knowledge up to a certain date: impressive, vast, not always current, not always right, and shaped by the assumptions of the people who wrote the things it learned from. Understanding how it learned is what lets you use it well and spot its mistakes before they matter.

The practical consequences are worth spelling out clearly. For questions about stable, well-documented topics, history, cooking, grammar, how machines work, how to write a business letter, what a particular medical condition involves, the AI is likely to be accurate and useful. For questions about recent events, current prices, who won an election last month, or what the updated guidelines say on a particular health topic, the AI may be wrong or simply uninformed. That is not a flaw. It is just the nature of a fixed training dataset meeting a world that keeps changing.

There is also a more subtle consequence. Because the AI learned from text written by humans over many decades, it absorbed the perspectives, assumptions, and blind spots of whoever wrote that text. Topics where the available writing was skewed toward one viewpoint will produce responses that reflect that skew. This does not mean the AI is deliberately biased in any political sense. It means that any large collection of human-written text carries the values and assumptions of the people who wrote it, and the AI learned from that collection.

None of this makes the training process flawed or the technology useless. It is just context for understanding what you are working with: a system that learned from human knowledge, inherits human limitations, and works best when you understand both what it knows well and where its knowledge runs out.

Next, we will look at the most persistent misconception about AI, the one that makes it seem more frightening and more magical than it actually is.

How it learned everything it knows

Want the full curriculum?