This is a talk I gave at the Actuaries Institute (AI!) Convention on 7th November 2024. In a few places I glossed over things too quickly, so I’ve filled in a bit of the conversation to make it make more sense.
I haven't done a careful study of success factors. What I’ve got is a list of observations that I've made based on the projects that I've done this year and last year.
For some reason vendors of help desk software seem to think I know something, so I have a few of those, there are a few telephony applications in there, an insurance claim handling system, project Symmachus, and a few others.
By the way, if you want to be involved in a research project where we analyze this more scientifically, please get in touch with me and I'll see what I can organize.
Model training is dead
This is the most controversial statement, but it’s either the top or second-top way I’ve seen AI projects fail.
Somewhere around the end of 2023, it became completely obvious that there was no way that any individual contributor or individual company could match the computational resources available to the majors (OpenAI, Anthropic, Google). Even just assembling a suitable data set is a project too large for almost anybody else.
These models have been trained on most of the content that's on the internet, so unless you have a few petabytes of proprietary data that isn’t on the internet, they already know everything that your model is likely to be able to learn.
If you're trying to train up a model from scratch, you almost definitely cannot compete.
Slightly more controversial is the idea that even fine-tuning existing models is not effective.
Fine-tuning a model or just doing some detailed prompt engineering is ultimately doing the same thing. It's an attempt to get answers to come out of a language model from a restricted subset of the total space.
I'm firmly in the "don't bother with fine-tuning" camp, but even those who disagree with me will acknowledge that the accuracy difference between fine-tuning and prompt engineering is not all that great.
So don’t bother training your own models.
There are a few exceptions to this.
If you're training up a linear model (or even a random forest) in order to understand what features affect the outcome that you care about, there will be a training process happening there. But the result is different. What you're trying to do there is get an explanation of behavior that exists in your data. The goal isn't perfect accuracy — the goal is insights in order to do something different.
Another legitimate use for training large machine learning models is if you're doing knowledge distillation. If you are a software engineer and you are working on an embedded device (Internet of Things, or a mobile phone, or some other small, low-power device) then there's a technique where you can make a small language model that functions about as well as a large language model but has far fewer parameters. Those projects do succeed and can provide enormous amounts of value.
But if you have a team of data scientists working with a rack of GPUs training up new models, they're almost definitely wasting time and wasting money.
I haven't seen a project like that succeed in at least two years.
To be clear, occasionally I see one that gets a result, but that result can be beaten by an intern working with the OpenAI API in an afternoon.
Track your context budget
So, given that almost all generative AI projects are going to be using a large language model to do most of the heavy lifting, what do you need to track if you are a project manager or product manager overseeing the development of this piece of software?
The language models that we use at the moment all have a fixed context window. For OpenAI, this is 128,000 tokens; for Google Gemini, it's 2 million tokens.
A token is roughly three quarters of a word.
So roughly speaking, you can throw 90,000 words at OpenAI and get a response. If you try to give it more words than that, either you get an error or it is ignoring the beginning or end.
All the techniques that we use to handle large amounts of data with large language models are all about managing that context buffer.
Normally, when we are buying AI services from the major vendors, we pay a per token fee. For example, GPT-4 Mini costs $0.15 per million tokens.
If it's not per token, then it is priced in terms of throughput (the ability to process some number of tokens per second).
So the context budget also speaks to your monetary budget: how expensive the product you are building is going to cost to run.
Decide on temperature, consistent or creative
It surprises me when people get this mixed up.
Temperature is a parameter that you can set when you are interacting with a large language model.
If you set the temperature to zero, then you will get the same answer every single time you interact with it. When it is trying to predict the next word, it will always choose the most likely next word.
There are many applications where repeatability is extremely important. If you are wanting to build a reliable test suite, then controlling the temperature would be a very useful thing to do, that you know whether something genuinely has changed of importance or whether it was just a weird anomaly.
For something like an insurance claim adjudication, you would almost definitely want to have the temperature set to zero.
But if you set the temperature to zero, you will not get creativity. Creativity from a large language model comes when the temperature is above zero, which means that very occasionally instead of choosing the most likely next word it will choose a less likely word.
And then having output that word, it then has to carry on the sentence and the paragraph and the rest of the document. And, as a result of that one word change, it may have to construct a very different output. This is what we mean when we talk about a language model being creative. It's producing something unexpected, unusual, and new.
(There's an interesting aside which asks whether or not human beings are doing anything different to that when they are being creative. Opinions differ.)
Now, if you are wanting to make an IT help desk bot, then being creative can be quite good. If the user has already tried everything, then having a deterministic tool that will just reply with the same thing is not very helpful. Sometimes it needs a flash of inspiration or something unusual in order to find an answer that solves a user's problem.
It’s not always easy to decide what you should do.
When I was translating the papyri and shards and other inscriptions for the Symmachus project, I set the temperature to zero because I wanted something reproducible. But in retrospect, I wonder whether I should have been more creative to make some of the translations come a little bit more alive.
But you have to decide one way or the other.
Create train tracks to avoid choices
Because I flit between Macquarie University in Sydney and ANU in Canberra, I spend a lot of time on the Sydney to Canberra 45-year-old XPT service. I'd like to say I have a love-hate relationship with it, a love-hate relationship with New South Wales TrainLink, but it's probably more of a hate relationship.
Any of you that have been on the 45-year-old XPT service from Sydney to Canberra will realise it was state of the art in the 1980s but has long since fallen behind. But no matter how much I denigrate the slowness of getting from home to Canberra in a bit over five hours (when I can drive it in three), I will admit that the train has never gotten lost.
I have never had the train driver decide to turn right at Goulburn instead of turn left. However, when I drive, particularly if I'm driving from the city, I have such a terrible sense of direction that I will often make one wrong turn early on and find myself heading north up the M1 to get to Brisbane.
I'm good at AI and maths and languages and computing. I've long since come to terms with not being very good at navigating. So let me prognosticate on the future of fun programming.
One of the most addictive and enjoyable styles of programming is to provide a language model with a list of tools. For example, if you have a database that needs to be searched, then you tell the language model that it is allowed to request a database search, describe what that search function can do and what parameters it needs and then you write some code that implements that search.
And that's it! You don't need to write a user interface, you don't need to write any logic. You just ask the LLM to navigate its way through your database to find whatever information you want. And generally, it can do it.
The danger is that you keep giving it more and more tools. It becomes more and more powerful.
This is like giving the LLM a car that it can drive around in with lots of intersections. It's a great way of prototyping an application in the initial stages, when you have less clear idea of how everything comes together.
It lets you use the LLM as a very high-speed data analyst and business analyst.
But then, those tools end up closer and closer together in a dense little subspace of language concepts. The more you have, the more likely it is that the LLM will pick the wrong one.
It’s like my driving. They have a tendency to make small wrong turns early on that they then don't know how to recover from.
So eventually, you need to make sure it has fewer options and longer sequences of instructions or prompts so that it can be reliable for bigger applications.
The awesome way to do this is to have some kind of feedback mechanism where the LLM can learn “the last time I was given this prompt, I called functions X and Y; X was marked as helpful, and Y was marked as unhelpful.”
The end result should be a long sequence of prompts, where at each prompt it has very few options. That's what I mean by train tracks.
Expose the prompt
So you have your application with a large number of prompts that need to get delivered to the LLM and some tooling and functions so that the LLM can perform whatever task it is you ask it to do.
There will be tools in there for different kinds of integrations. Perhaps you might have a tool for sending an email, a tool for raising a ticket in some internal task management system, a tool for interfacing with this application and a tool for interfacing with another application.
The least effective projects I've seen are the ones where the prompts are hard-coded into the software and can only be changed by programmers.
The most effective projects I've seen are the ones where the prompts are exposed to end users.
If end users have the ability to update the prompt, change the prompt, insert extra steps that get prompted, and have the power to create automations within the system just using natural language, exciting things happen.
Prompting Substitutes for Programming
I don't mean that all software engineering jobs are about to disappear and be replaced by people writing prompts. (Although it's mind-boggling how far people without much programming experience can now get, just with a bit of prompting, here's a video of a eight-year-old girl on her second lesson learning how to program. Skip to the end to see a UI colour scheme that only an eight-year-old girl could create too.)
What I mean is that previously there were automations and integrations that could only happen by being programmed explicitly. Many of those now don't require the skills of a software engineer to do.
The dynamic of this is kind of interesting. If you are not able to program for yourself but you see some possibility for automating some part of your workflow, the process goes something like this:
You somehow realize that automating is possible. In many organizations, people never get beyond that point.
You somehow (without knowing anything about programming) identify that it would be cost-effective for the organization to have this task automated.
You submit a proposal to the high priesthood of programmers and try to persuade them that this is a worthwhile thing to do.
The high priesthood, manifested in the form of the product manager or the scrum master or somebody else who has control over the Kanban board, then determines where to slot that in. e.g. “That'll get looked at in six months time.”
A product manager and some business analysts will then interview you, completely misunderstand the purpose for the integration or automation, and fail to understand the problem you are trying to solve. Those mangled instructions will make their way to the programmers, who will then program it up, wondering why they make no sense.
This is one of the reasons that AI projects in enterprises fail 80% of the time.
It is that long, complicated process that prompting substitutes for.
The end users themselves can explain in English language what needs to be automated.
If they have access to the prompts, they can insert those prompts into the sequence that is presented to the LLM and do the automation integration themselves.
This is powerful and it has an interesting implication.
AI innovation happens bottom-up
The number one observation I've seen that distinguishes successful projects from unsuccessful ones is whether there is an expectation that AI innovation is to be done by a central team or whether there is an expectation that AI innovation should happen at the individual contributor level.
The organizations that have empowered individuals to explore ways that they can use AI to achieve outcomes reap the benefits very, very quickly.
The organizations that have a centralized gatekeeping process are the slowest to innovate.
And if there's one key message to take away, it is that the pace of AI innovation is fast.
We have seen jobs and industries transformed in 12 months, just from the capabilities that AI has at the moment. This will only accelerate.
There isn't time to be inefficient with AI projects.
There isn’t time to hobble projects with shackles that we know cause problems.
There isn’t time to ignore the evidence of best-in-class projects.