On Large Language Models

I’ve been pretty quiet about ChatGPT and Bing for a number of reasons, the most pertinent of which is that I have so much more going on in my life right now.

But I think it’s time to jot down some notes on how I feel about Large Language Models (henceforth abbreviated to LLMs) and the current hype around them.

And I’m going to try to do that from the perspective of someone who:

Graduated from college soon after the peak of the 90’s AI Winter (yeah, I’m old–we call it “experience” these days)
Actually decided not to major in AI (but rather in more networking-focused topics) because of said Winter, although I went and racked up my point average by acing AI coursework as optional credits.
Survived several hype cycles over the past 30 years.
Dove into analytics and data science during the “resurgence” in 2012 and enjoyed it immensely (as well as racking up a few ML certifications) before getting sucked into telco again.
Spends an unhealthy amount of time reading papers and mulling things.

Plus the field is evolving so quickly that I’ve drafted this around four times–all the while progressively shrinking it it down to a quick tour over what I think are the key things to ponder.

How Smart is an LLM, anyway?

I’m going to start with an obvious fact, which is that LLMs just seem to be smart. Sometimes recklessly so.

Yes, typical outputs are vastly better than Markov chains, and there is a tendency to draw a rough parallel with running the probabilities for the next token through the LLM.

Like people like Tim Bray have pointed out, that is seriously underestimating the complexity of what is represented in model weights.

The reason why the Markov analogy breaks down is that LLM output is not probabilistic–there is randomness involved in setting up inference, sure, and sequential correlation between output tokens, but the factors driving the output are several dozens of orders of magnitude above what we were used to.

Random outcomes like the LLM starting to hallucinate are just par for the course of a neural network trying to go beyond the training data, or focusing attention on parts that lack enough conditioning to have a decent output.

But going back to the initial point, there is zero “knowledge” or intelligence in an LLM. There are impressive amounts of correlation, to be sure, but the core principle harks back to the first AI Winter–it’s just that we’ve crossed a quality threshold that seemed hitherto unattainable.

It may look like emergent behavior, but that is simply because we can’t trace every step that led to the output. There is no agency, nor real “understanding”.

And, as anyone who’s read Douglas Hofstadter will point out, there is also no “strange loop” or a coherent capability to self-reference–the outputs are just the result of navigating an LLM’s internal representation of massive amounts of data, and they’re entirely functional in more than one sense of the word.

Things Are Just Getting Started

Shoving all those orders of magnitude into something that can fit into an enterprise-class GPU (or, increasingly, a GPU and a hefty set of NVMe drives) takes quite a toll, and training LLMs requires massive computational power that is (for the moment) outside an individual’s reach.

But that is certain to change over time, and inference is already possible on consumer-grade hardware–like this past couple of weeks’ spate of news around llama.cpp proves, there is a lot of low hanging fruit where it regards optimizing running the models, and at multiple levels¹.

Although things like weight quantization degrade the output quality quite a bit, I expect more techniques to pop up as more eyes go over the papers and code that are already out there and spot more gaps and tricks to run LLMs efficiently.

And despite the fact that the spotlight is on OpenAI and the massive cloud infrastructure required, I personally find it a lot more interesting to figure out how low LLMs can go and still produce coherent results.

This because I have fairly high hopes for tailored models, and see a lot of value in having fully on-premises and even embedded solutions–I know I’m bucking the trend here, but the history of computing is one of decentralization, and you’re probably reading this on a smartphone… So my point should be obvious.

What Are LLMs Good For?

Having spent entirely too long dealing with customer support and call centers (I actually find the generic “chatbot” thing extremely annoying, and resisted getting into building those, but such is life), I’d say that, at the very least, LLMs are certain to take virtual assistants and support chatbots to the next level.

And no, this is not a new idea–it’s been hashed to death over the years, and the real problem is that most support knowledge bases are useless, even if you manually tag every snippet of information and carefully craft interaction flows. Traditional chatbots (and even summarization-driven ones) simply suck at doing the kind of basic correlation even a script-driven, barely trained human can pull off on autopilot, and hacking them together was always a brittle and unrewarding endeavor.

But an LLM is trained on other content as a baseline, which gives it a much better ability to fill in the gaps in such knowledge bases, and certainly have better conversational skills than a goldfish–and I can see LLMs doing a decent job in highly patterned, formalized inputs like legal documents, medical reports, retail catalogues, etc.

How Reliable Are These Things?

To be honest, right now, not that much. I wouldn’t rely on any publicly available LLM for decision-making of any kind (coding, advice, or even accurate summarization), although every iteration improves things noticeably.

Sure, some of the humor and “style transfer” is pretty hilarious, but LLMs still have trouble with basic math, let alone writing reliable code²–they’re not even that useful at “rubber ducking” a problem.

Outputs are generally shallow and LLMs still have trouble creating coherent long form without hallucinating, but I do think they can be useful as baselines for a human to improve upon, as long as that person has a good enough grasp of the problem domain to spot obvious flaws in “reasoning” (not just incorrections, but also gaps) and the willingness to double check any references.

Of course, any of those sanity checks seem absent from a lot of the hype-driven discussions I’m seeing online… But, more to the point, LLMs do seem to knock things out of the park for short interactions.

Which is why I think the search market disruption gambit is going to pay off handsomely–LLMs make for a much better search experience because you get adjacent information you would otherwise be unable to get from either direct or statistical matches (and you don’t get pesky ads, keyword squatters, etc.)

How Manageable Are These Things?

This is where I have the most doubts, to be honest.

The current “programming paradigm” is hopelessly primitive, and all the early deployment shenanigans prove it–prompt stealing and prompt injection attacks (which can be much more interesting than you’d expect) remind me of all the loopholes Asimov managed to squeeze out of The Three Laws of Robotics.

Plus the ease with which the models “hallucinate” and veer off into the wild blue yonder were, until recently, being dealt with by ham-fisted tactics like limiting the number of consecutive interactions with the model.

In short, it all feels… very Sorceror’s Apprentice, to be honest.

And I don’t think “stacking” models or just creating embeddings is going to help here–long-term curation of model inputs is going to be key.

Which means time-consuming, costly, and ever more challenging work to improve general purpose LLMs, especially those targeting search (where having non-AI generated training sets is going to be harder and harder).

Fast Iteration, But What About Fast Training?

Another important constraint that is being glossed over is that there is no easy, immediate feedback loop to improve an LLM–in the current chat-like interaction models you can add more context to a session, but:

It doesn’t really “stick”–sometimes not even subsequent invocations (even if the session wrappers are continuously improving, you’re effectively adding stubs to the original prompt, and that can only go so far).
Any on-the-fly corrections don’t become part of the core model (you need to have a full training iteration).

These things can be worked around, but are fundamental limitations–and yet, they don’t have any real consequence for simple one-shot tasks like “summarize this webpage” and most of the “productivity boosters” we’re likely to see over the coming months.

But they do compound my notion that LLMs feel more like an impressive party trick than a broadly sweeping change in paradigm–at least for now. Their real impact lies elsewhere, and most likely beyond the obvious chatbot scenarios.

It would be nice to take away a lot of the drudgery we’ve baked into computer use (as well as several typical knowledge worker tasks), although there are interesting (and risky) implications in empowering certain kinds of people to mass-produce content³…

Conclusion

So where does this leave us?

Well, we’re clearly in the upward swing of the hype cycle. And, like I pointed out at the start of this piece, I’ve been there before–the quick iteration, the optimizations, the unexpected new techniques in established domains, and the fallout (both good and bad). Those parts are not hard to predict.

The big difference this time is that for users, the barrier to entry is effectively nil, and, again, the outputs are way better (and more impressive) than anything else we’ve seen before. Even if it’s still just a more elaborate Chinese Room, there is a lot more public interest and momentum than is usual in most tech hype cycles.

So yes, this one is going to be a bumpy ride, and not just for geeks. Make sure you have your metaphorical seat belt on tight.

And while I was revising this Pytorch 2 came out, with a nearly 50% performance boost for image models–I’m just waiting for xformers to fall in line to upgrade my Stable Diffusion setup… ↩︎
I routinely try to get LLMs to, say, invert a heap, or even to compose SQL queries (which I hate doing), and the results are always abysmal. I can’t even imagine how badly they would fare in medicine or law. ↩︎
And I don’t mean political parties or nation states here. The prospect of mass-produced A.I.-accelerated reports, presentations, memos, etc. should be enough to give any corporate knowledge worker pause.. ↩︎

Tao of Mac