LLMs Are the CPUs of the AI Era

François ZaninottoOctober 10, 2024

#ai#architecture

After building several apps powered by LLMs, I developed a mental model that helps me think more clearly about "AI programming."

LLMs Are Like CPUs

A CPU runs a program, typically written in assembly language, processes an input, and delivers an output. Often, this means modifying the memory state. Similarly, an LLM (Large Language Model) takes instructions in the form of text, processes user input, and produces text output. It can even transform or enhance the input itself.

But, just like CPUs, LLMs have limitations. CPUs rely on external memory and storage for persistence, as their internal memory is limited. Likewise, LLMs depend on external data to enrich their context, as they can't know everything from their training data alone. We need to feed them information from the outside world to achieve more complex tasks.

CPUs are also supported by specialized components—memory controllers, security enclaves, display processors, and more. LLMs are no different; they need specialized support too. Smaller LLMs can assist larger ones, managing tasks like filtering outputs or interpreting commands. Gen AI programs often need multiple LLMs and other components to handle data transformations, manage communication, safeguard content, and more.

This analogy of LLMs as CPUs is why we naturally turn to programmers for building Gen AI solutions. But maybe it’s time to rethink this approach.

Differences Between LLMs and CPUs

CPUs and LLMs have several key differences:

Speed: CPUs are lightning fast, measured in trillions of floating-point operations per second (TFLOPS). LLMs, on the other hand, generate a few dozen tokens per second (TPS), so they're much slower. We're talking about a difference of ten orders of magnitude.
Determinism: CPUs are deterministic—the same input yields the same output. LLMs are probabilistic. Even with identical inputs, their output can vary. You can lower the temperature parameter to make LLMs more deterministic, but that often makes them less creative and less capable at problem solving.
Infinite Configurations: In traditional computing, there are limited ways to reach a solution, and programmers can agree on the best implementation in terms of computational efficiency. In LLM programming, there are infinite ways to get to an answer. Everything can be adjusted: prompts, models, temperature, reference documents, or even supporting LLMs. This variety makes the process inherently uncertain—one configuration may work perfectly in one scenario but fail in another—and intimidating.
Error and Reliability: In traditional systems, unexpected errors are rare, sometimes caused by phenomena like cosmic rays flipping a memory bit—an occurrence that error correction can handle. In contrast, failure in LLM-based systems is inevitable. Even the best LLM agents might only work 95% of the time, leaving a significant chance of error.
Cost: Running LLMs is expensive. For example, OpenAI's API charges $2.50 per million input tokens and $10 per million output tokens. Processing requests with moderate input and output sizes can cost $5 per 1,000 requests. In comparison, edge requests from a cloud provider like Vercel might cost $2 per million. The difference in cost is stark.

Given these differences, traditional programming approaches struggle with LLM-based systems.

Programming with LLMs

So, how do we build efficient Gen AI programs?

First, not all tasks require the most powerful LLM. Break the problem into smaller sub-tasks, use traditional programming techniques for data-heavy processes and user interface rendering, leverage small models for simpler AI tasks like sentiment analysis and spelling correction, and reserve the heavy lifting for larger models only when necessary.

Orchestration becomes crucial because LLMs are slow. Streaming, parallelism, and speculative execution can help speed up the process. For instance, while Midjourney shows a half-finished image as it generates, you still need to wait for the full result to evaluate it. The same goes for using LLMs to determine function parameters—you can’t execute the function until all parameters are ready.

Error handling is another essential component. You need mechanisms to detect when an LLM has produced a wrong or harmful output and decide whether to retry, adjust, or abandon that task. Many commercial LLMs have built-in safety layers that cut off offensive content during streaming (a.k.a. safeguarding).

Evaluating the effectiveness of an LLM setup is like running continuous integration tests. You need robust evaluation to ensure reliability, given the numerous ways to configure and prompt an LLM. The eval systems may even involve other LLMs, which require their own tuning. Just like with automated tests, you need to rerun the evaluation benchmarks after every configuration change to catch issues before they reach customers.

Monitoring costs is crucial too—FinOps must be part of Gen AI from day one. Whether using a hosted LLM or running one on your own GPUs, managing expenses is a significant challenge, especially as you scale.

The good news? You don't need to be an AI researcher to use LLMs. Off-the-shelf models, standardized APIs, and abundant tutorials have made it possible for traditional developers to build Gen AI applications.

The Future of Programming

However, traditional developers aren't used to the nature of Gen AI programming. Implementing a feature with Gen AI takes longer and is often frustrating. Improving one aspect, like speed, may negatively impact another, like cost or accuracy. Many features released to production end up breaking in ways developers couldn't predict.

Learning this new type of programming is challenging. Most tutorials cover simple use cases, and the real complexity emerges when you go beyond basic tasks. The first attempts are often fragile, and it can take months for traditional developers to reach a level where they can deliver quality Gen AI products.

And that’s just using existing LLMs. Fine-tuning or training your own model adds even more uncertainty and expense.

Meanwhile, LLMs themselves are getting better at programming. If you want 100 variations of a prompt, ask an LLM to generate them. Tools like GitHub Copilot are bridging the gap, acting like compilers that translate human intentions into code.

Conclusion

Who will program tomorrow's AI agents? We still need traditional developers to craft the user experiences around LLMs. For example, the value of OpenAI’s new Canvas isn’t just the LLM it uses; it's the thoughtful user experience—selecting text, choosing a brush, and seeing results in context.

Yet they must become augmented programmers to effectively tackle the new challenges of Gen AI, integrating LLMs into their workflow. We also need a new class of specialized AI programmers—those skilled at identifying corner cases and designing optimal architectures for AI agents. This requires adapting our educational systems: programming schools must begin teaching these new paradigms to equip the next generation for the unique challenges of Gen AI.

As for me, I love exploring these new paradigms, even though they require a tremendous learning investment. I have made LLMs an essential part of my toolkit. They open up so many new possibilities that I find myself imagining a new product idea almost every day. The ecosystem evolves rapidly, but I'm determined to keep up.

I believe human coders like me will continue to have a place, as long as we can provide more value to stakeholders than AI-powered tools alone.

Did you like this article? Share it!