How AI Works 🤖🔎

An entirely non-technical explanation of how LLMs actually work

Learning about AI

I’ve been surrounded by discussion about AI lately. I’m sure you have too. The endless discussions of its implications, the ethical questions it raises, the pros and cons. Yet little of the discussion amongst my non-technical friends ever touches on how any of this stuff actually works. That’s because, from the outside, the concepts seem daunting. The idea of grasping how large language models (LLMs) function seems insurmountable.

But it’s not. Anyone can understand it. And that’s because the underlying principle driving today’s surge in AI is fairly simple.

So bear with me for fewer than two-thousand words, and I’ll try to explain—without a single technical word or mathematical equation—how LLMs actually work.

The Menu

Imagine this: You’re cooking a dinner, but you need to come up with one more side dish to serve. The food you’re preparing is just shy of being enough. So we need one more component to add to the meal.

But that’s easier said than done. What we pick needs to fit in with the meal. If the meal is savory, our side dish should be too. If it already has a salad, we shouldn’t make another. If the meal is starch heavy, maybe we’d want to throw in a roasted vegetable.

Wouldn’t it be nice to have an app that just tells you what to make? And not randomly. You feed in what you’re already making, and it tells you the optimal side dish to add. This app should work for any meal, with any combination of dishes and flavors, regardless of whether it’s feeding four people or forty.

Here’s how we’re going to make this app. Two simple steps…

First, we’re going to have it understand how to think about each meal in a way a computer can understand. After all, computers don’t have taste buds. They need to be able to take a concept over which they have no intuitive understanding (food) and encode it as some kind of data capturing everything that might impact how well it fits with other food.

Second, we’re going to have it learn a way to take any set of existing dishes and spit out another. It’s not merely going to memorize what it’s seen before. Recall that this app needs to work for any combination of dishes, even ones it’s never seen paired together. So we’re not just going to program the system. We’re going to teach it.

Step One: Modeling Meals

So, step one. We need to teach the computer to think about meals as data. We’re not going to do this by telling it things about the meal (like what it tastes like or what it fits with). That’s the old type of machine learning. Too limiting; too error-prone. Instead, we’re going to just feed it a lot of data about what types of dishes people have paired together for meals in the past.

Let’s consider two types of dishes: say, a caesar salad and a caprese salad. We, as humans, know that these two dishes are similar. They’re both Italian, they’re both salads, they both contain vegetables and cheeses… But for a machine to learn how similar these two dishes are, it need not know any of the above.

It’s highly likely that as we search through our mountain of data, whenever we see a caesar salad, we’re likely to see it paired with other Italian dishes. And it’s also likely that when we see it, we’re not going to see another salad in the meal. Interestingly, the same can be said of caprese salads. They won’t typically appear with other salads, but they will appear with Italian dishes.

Because these two dishes will often co-occur with the same types of other dishes, we can categorize them as being similar. They tend to be found in the same patterns of food. You might say “a dish is characterized by the company it keeps.”

And this isn’t that intuitive. Notice that we didn’t look for any meals where caesar and caprese salads occur together. They never need to occur together for us to deem the dishes similar. They simply need to be found amongst the same other dishes for us to determine that people generally find them interchangeable and therefore quite similar.

Here’s another way to think about what we just did. Imagine we wanted to graph all food on this chart:

And to start, we took all the possible dishes we found in our data, and plotted them randomly:

Here, we’re only showing four dishes for illustrative purposes. But imagine literally every possible dish.

Now as we look through our data, each time we find two foods that co-occur with other dishes, we can move them closer together. As we see different types of sushi that tend to be coupled with the same miso soup, we’re going to inch the sushis toward each other. As we see pizza and spaghetti both appear alongside garlic bread, we’ll let them come together too:

And after doing this many times (and I mean many times), something magical occurs. Dishes that are interchangeable will cluster very closely together. Dishes that are somewhat interchangeable (say, tacos and burritos) will appear closer to each other. And dishes that are rarely if ever interchangeable (say, burgers and sushi) will be placed far apart.

Now, in practice, two dimensions aren’t enough. Every cuisine and dissimilar meal needs to be sufficiently spaced, which means that the real way to graph this would be in a graph consisting of many more axes (hundreds, maybe thousands). That’s impossible to visualize, but the underlying concept is the same. We scatter all our foods and move them closer as they co-occur with similar dishes.

As a shorthand, I’m going to refer to this larger many-axis graph as meal-space. Every possible food exists in meal-space, sitting at coordinates close to those foods with which it is interchangeable, and far from those that are very different.

Taking a step back, let’s appreciate how fascinating this is. We were just able to come up with a very accurate model of food types whereby similar ones are grouped together and different ones are far apart. And we did it without factoring in anything about how the foods taste or what they’re made of.

Plus, because we trained this on so much data, we’re able to do something else. We’re able to do food arithmetic.

Food arithmetic? “Nir, you’re crazy!”

I assure you, I’m not. You’ll have to take my word for it, but it turns out that the placement of dishes in our meal-space isn’t random. In fact, not only are similar meals spaced together, but the relationship between the foods makes logical sense. Foods containing bread all appear on one plane together. Salty foods lie on a common line. Maple-flavored things have some kind of mathematical link.

And that allows us to actually do things like this: If I were to take the coordinates for a burrito and subtract the coordinates for a tortilla, I’d end up near the point of a burrito bowl. If I were to take the coordinates for chicken noodle soup, subtract the coordinates for noodles, and add the coordinates for rice, I’d end up near the point for chicken rice soup.

Food arithmetic! 🤯 

The important takeaway: The placement of meals in meal-space isn’t random anymore. In fact, there are underlying, hidden mathematical patterns that mean every food is placed in a logical way relative to every other food.

Step Two: Finding Patterns

Okay, great, so we’ve created meal-space and given every type of food some kind of coordinate that makes sense relative to every other food. Now what?

Well, let’s train our model again. Only this time, we’ll feed it whole meals—we’re talking every meal we’ve ever seen—and we’ll ask it to find patterns. Specifically we want to train our program to answer this question: If a meal contains A, B, and C, what type of dish is most likely to comprise D?

And to do that, all we have to do is ask, for every meal we train on: What does that look like in meal-space? For instance, say we see many meals that share dishes in these four areas of the graph:

We can now generalize and think solely about coordinates in meal-space, ignoring which foods even trained us to recognize this pattern in the first place. We can conclude that if a meal already contains dishes in these three regions, the best fourth component would be found in that last region:

Recall that “a dish is categorized by the company it keeps.” And because our model was trained to think about types of foods and relationships amongst food, rather than what specific dishes contain and how they taste, it can take any scenario and any combination of flavors and figure out the most optimal dish to add to the meal. Given a few regions of food, it just needs to find the most common region the next dish would be in…

…which takes us back to our original goal, now completed. We wanted to build an app that could reliably tell us which dishes to pair with a selection of other dishes. And we did just that.

Words Instead of Recipes

So what does all of this have to do with large language models?

Simply replace the concept of meals with sentences. And replace the concept of dishes in those meals with words. That simple substitution, with the same framing and approach, essentially gets you to the generative text-based AI tools we’re all familiar with today.

Step one: Train a model to understand the relationships between words based on how often they appear in similar contexts. “A word is categorized by the company it keeps.” Feed it a ton of human-written data (and when I say a ton, I essentially mean the entire internet), and let it nudge word coordinates around appropriately.

The output isn’t called meal-space anymore. It’s called vector-space. But the principles are the same. The system has no awareness of what any word means (just like it had no awareness of how a dish tastes). It only understands how that word is related to every other word in vector-space.

Step two: Find patterns. If a sentence contains words A, B, C, what’s the next most likely word to appear? If it contains X and Y, what region of vector space should it look in for what comes next?

In the case of LLMs, all they’re really doing under the hood is what’s referred to as “next word prediction" (just as our original analogy performed “next dish prediction”). For instance, let’s say you prompted an LLM with: “Tell me you love me.” It would search through all its pattern finding to answer one question: What word is most likely to follow that sequence of words? Or phrased differently: Given the vector-space coordinates of the words in that sentence, what patterns have I seen in other sentences to determine where I can find the next word?

The answer the LLM will find is “I”. And having determined that, it will tack the “I” to the end of your original prompt and feed that whole thing back into itself. Now, what word is most likely to come next after “Tell me you love me. I”? Why, “love”, of course! Tack it on, take the whole thing, and feed it back in. What’s likely to come next: “Tell me you love me. I love”?…

You get the idea.

Wrapping Up

Of course, there’s a bit more nuance. There’s some fancy math and complex computing. But the fundamentals truly are no different than those in the meal-planning example.

To me, that sheds light on why this AI phenomenon we’re living through is so fascinating. Considering how transformative this technology is, it’s not actually that complicated. A few simple mathematical concepts, a whole lot of training data, a sprinkle of salt and pepper, and you’ve essentially built yourself a thinking machine.

Learning More

If you’d like to learn more about AI in a way where complicated topics are regularly distilled down to easy digests, I recommend subscribing to Answers on Artificial Intelligence. Each email makes you smarter about AI, no former knowledge required.

And do you know someone else who might benefit from understanding today’s AI craze? Please share this article with them. Forwarding and sharing is the number one way Z-Axis is able to grow.

Subscribe to keep reading

This content is free, but you must be subscribed to Z-Axis to continue reading.

Already a subscriber?Sign In.Not now