AI in the era of Large Language Models

Incredible recent breakthroughs in Deep Learning have erased decades of failed promises and rehabilitated AI from a lost cause to one filled with hope and promise.

There’s an old joke among physicists about Nuclear Fusion that it is a technology that is 30 years away — and always will be. Till recently, the same could be said about Artificial Intelligence (AI), a seventy-year-old field with a dismal track record of over-promise and under-achievement.

Incredible recent breakthroughs in Deep Learning have erased decades of failed promises and rehabilitated AI from a lost cause to one filled with hope and promise.

In particular, advances in scaling Large Language Models (LLMs) — deep neural network models with billions of parameters trained on massive amounts of internet data — are bringing us closer to a future of machines with human intelligence. And as with most things in technology, it happened “gradually and then suddenly.”

LLMs are the breakthrough AI has been looking for since arguably the 1950s when John McCarthy first coined the phrase Artificial Intelligence to describe “the science and engineering of making intelligent machines.”

There is still much we don’t understand about LLMs: how they work, the nature of their intelligence, and how close they can get us to Artificial General Intelligence (AGI). But we know LLMs have two remarkable capabilities that make them qualitatively different from any previous AI.

Large Language Models are “Meta-learners”

Large Language Models are “Meta-learners,” which is to say they have learned how to learn. An OpenAI paper that introduced GPT-3 (a pioneering Large Language Model) describes it in the following way:

[GPT-3] develops a broad set of skills and pattern recognition abilities at training time and then uses those abilities at inference time to rapidly adapt to or recognize the desired task.

Sometimes the model just needs an English description of the task (aka zero-shot learning):

Other times, the model needs a few examples and can reason by analogy (aka few-shot learning):

Large Language Models Show Emergent Intelligence

As we scale the size of these Large Language Models and train them on more data, something very surprising happens — the model (above certain scale thresholds) spontaneously develops new intelligence and capabilities not present in smaller versions of the model.

In a 2020 paper titled “Emergent Abilities of Large Language Models,” researchers from DeepMind and Stanford showed that when a Large Language Model goes over a size threshold, it can perform new tasks — such as adding 8-digit numbers, unscrambling words and more.
In one test, the authors evaluated models on a battery of 57 tests covering a range of topics, including math, history, and law. They found that while smaller models struggled with these questions, once a model reached a certain scale, performance on these questions dramatically improved.

Scaling, For Now, is All We Need to Make AI Smarter

Given that scaling Large Language Models improves performance, it’s natural to wonder if there are limits or diminishing returns to scaling.
While nothing can scale forever, there is good empirical evidence that the current regime of scaling laws will continue for a while—so for now, we should expect smooth, predictable improvements as we apply greater and greater computing capacity to train these models.

This has led to a big push toward scaling, with organizations using massive supercomputers to build increasingly more capable AI models. By some estimates, in the large-scale model era, the compute used for training is doubling every 9–10 months!

While there is debate about how much of the computing budget should go towards building bigger models versus training them on larger data sets, there is broad consensus that we are in a golden era of scaling — a regime where scaling is not everything; it’s the only thing.

Large Language Models are Still Far from Perfect

The authors of GPT-3 noted the model sometimes generates text samples that “repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs.”

The are many other examples where LLMs can look very smart on a set of questions.

And then fail the most basic of questions.

The author Steve Johnson puts it eloquently in a recent NY Times piece.

If you spend enough time with GPT-3, conjuring new prompts to explore its capabilities and its failings, you end up feeling as if you are interacting with a kind of child prodigy whose brilliance is shadowed by some obvious limitations: capable of astonishing leaps of inference; possessing deep domain expertise in a vast range of fields, but shockingly clueless about many basic facts; prone to strange, senseless digressions; unencumbered by etiquette and social norms.

Paul Graham, identifying the same phenomenon, makes a broader point about how surprising this trend in AI is.

He then goes on to elaborate on some implications of this technology trajectory:

I imagined AI would start by talking about the blocks world and eventually progress to writing essays. Instead, we have essays from day one, but bogus ones and AI progress by making them increasingly plausible.

One strange consequence of this trajectory may be that early AI will be useful, among other things, for bogus tasks. E.g., that it will be good at generating the sort of text we’d classify as boilerplate: undergrad papers, code in less abstract languages, etc.

Putting all this together, Large Language Models seem to satisfy the initial condition for a classic low-end technology disruption play. These models are good enough for some tasks under careful human supervision (content creation, code generation, design, search, etc.) but a bad fit for high-end tasks, where a poor prediction can cost money (fraud detection) or worse, lead to loss of life (self-driving cars).

As a result, we see AI today bifurcating into two segments: Specialized AI (Classic Big Data ML for task-specific, custom models) and General AI powered by LLMs (or, more broadly, Foundation models).

As LLM models get better through scaling, we should expect that for many (but not all) tasks, the price/performance tradeoffs will tilt in favor of General AI over Specialized AI, especially since adapting LLM models (through prompting or fine-tuning) is cheaper than building specialized models from scratch.

Going further:

  1. LLMs will increase the addressable market for AI applications. Historically, the high cost of building an AI-powered App limited the market to only the wealthiest companies (Google, Facebook, Amazon, etc.). LLMs level the playing field by making AI accessible to a much wider range of businesses. And with the growth and rising popularity of open-source LLM models (Stable Diffusion, BLOOM), it seems like these enabling technologies will soon be available to all.
  2. LLMs solve the dreaded data bootstrapping problem: how do you build good AI into an App without users or data? With LLMs, the out-of-box AI may be good enough to get an App early users, and as the App grows its user base and accumulates more data, the stock LLM model can be fine-tuned to further improve the App, creating a data moat that increases with scale.

LLMs have the potential to touch every existing application category and create whole new ones. We’re early in the cycle, but we see a couple of promising themes where LLMs can have an immediate impact.

Reimagining Human-Computer Interaction

We built computers to make our life easier, but we had to teach ourselves how to program them and train users how to use them. LLMs reset this dynamic so we can finally have Apps with native natural language interfaces.

At its most basic level, this could be just a fancier search box. As a next step, we can imagine reworking the App interface to a prompt-based conversation window that becomes the starting point for all interactions. For example, a user can ask, “give me all invoices that are more than 30 days late; sort by amount and color all repeat offenders in red.”

At the limit, this interface can plug deep into the application, orchestrating workflows, routing, executing business logic, triggering actions, and getting “work done.” As this layer cuts deeper and deeper into the guts of the App, the potential for disruption of incumbent software players will only increase.

Business Intelligence is an obvious candidate for such a disruption; LLMs can enable a deep, interactive, collaborative dialog between user and machine, dramatically improving both the search and discovery of important business insights.
Other categories may also be up for grabs, including Search, which could be vertically focused and powered by generative AI. For other categories, LLMs may end up being a sustaining more than a disruptive innovation. For instance, an LLM-enhanced CRM may end up being Salesforce itself.

Scaling Human Work

GitHub co-pilot, an automated code-writing tool, has been a runaway hit in a short period of time. According to GitHub CEO Thomas Dohmke, Copilot is now handling up to 40% of coding among programmers using AI. Not only do developers love CoPilot, but it is also making them more productive. According to one estimate, developers using GitHub Copilot completed their tasks 55% faster than those that didn’t.

The Co-pilot model can be applied elsewhere, from scaling creatives (Mid journey, Stable Diffusion, etc.), to scaling content creation (Jasper) to scaling legal work and many others. We can also imagine many enterprise SaaS Apps with an built-in co-pilot to make App users more productive. Some categories already in the business of scaling human work (such as RPA) could be entirely reimagined with AI-powered agents replacing brittle, inflexible task bots.

It’s a cliche to say that all breakthrough technologies are universally ridiculed at inception. This is, after all, the story of everything from the invention of the light bulb to the telephone to airplanes and, of course, computers.

Breakthrough technologies test our powers of extrapolation. It’s hard to picture how a promising but imperfect technology can grow to become a million times better.

Equally, they test our imagination. What could we do with such a magical piece of technology?

Even visionary, forward-thinking companies commercializing these promising technologies seem to underestimate them. In 1943 Thomas Watson, chairman of IBM, said, “I think there’s a world market for maybe five computers.”

More than three decades later, in 1977 (with Moore’s law in full swing), Ken Olson, the founder of Digital Equipment Corporation, said, “There is no reason for any individual to have a computer in his home.”

Until recently, Artificial Intelligence provided a curious counterpoint to this narrative, a field with no shortage of grandiose claims and hype but a track record of chronic underachievement.

After a promising start in the 1950s with early proof-of-concept programs reasoning with logic and theorems, Herbert Simon, a Nobel Prize winner and early AI pioneer, boldly predicted that “machines will be capable, within twenty years, of doing any work a man can do.” Marvin Minsky, another towering figure of the field, predicted in 1967 that “within a generation, the problem of creating artificial intelligence will substantially be solved.”

None of these predictions came to pass. While there was no shortage of imagination, it turns out there was no roadmap to scale AI and make it smarter.

Large Language Models are giving AI the roadmap it needs. AI today bears little resemblance to previous AI hype cycles. In fact, with scaling laws and rapid progress, a better comparison to AI may well be the Microprocessor revolution. And if that comparison holds up, history tells us no matter what we think of AI’s potential today, it will only end up exceeding our expectations.

Ramu Arunachalam General Partner
Additional Reading