Forget training, find your killer apps during AI inference

Most organisations will never train their own AI models. Instead, most customer’s key challenge in AI lies in applying it to production applications and inference, with fine tuning and curation of data the core tasks.

Key here are use of retrieval augmented generation (RAG) and vector databases, the ability to reuse AI prompts, and co-pilot capabilities that allow users to question corporate information in natural language.

Those are the views of Pure Storage execs who spoke to Computerweekly.com this week at the company’s Accelerate event in London.

Naturally, the key tasks identified fit well with areas of functionality added recently to Pure’s storage hardware offer – including its recently launched Key Value Accelerator – and also with its ability to provide capacity on demand.

But they also illustrate the key challenges for organisations tackling AI at this stage in its maturity, which has been called a “post-training phase”.

In this article, we look at what customers need from storage in AI in production phases, and with ongoing ingestion of data and inference taking place.

Don’t buy GPUs; they’re changing too quickly

Most organisations won’t train their own AI models because it’s simply too expensive at the moment. That’s because GPU hardware is incredibly costly to buy and also because it is evolving at such a rapid pace that obsolescence comes very soon.

So, most organisations now tend to buy GPU capacity in the cloud for training phases.

It’s pointless trying to build in-house AI training farms when GPU hardware can become obsolete within a generation or two.

That’s the view of Pure Storage founder and chief visionary officer John “Coz” Colgrove.

“Most organisations say, ‘Oh, I want to buy this equipment, I’ll get five years of use out of it, and I’ll depreciate it over five or seven years,” he said. “But you can’t do that with the GPUs right now.

“I think when things improve at a fantastic rate, you’re better off leasing instead of buying. It’s just like buying a car,” said Colgrove. “If you’re going to keep it for six, seven, eight years or more, you buy it, but if you’re going to keep it for two years and change to a newer one, you lease it.”

Find your AI killer app

For most organisations practical exploitation of AI won’t happen in the modelling phase. Instead, it’s going to come where they can use it to build a killer app for their business.

Colgrove gives the example of a bank. “With a bank we know the killer app is going to be something customer facing,” he said. “But how does AI work right now? I take all my data out of whatever databases I have for interacting with the customer. I suck it into some other system. I transform it like an old ETL batch process, spend weeks training on it and then I get a result.

“That is never going to be the killer app,” said Colgrove. “The killer app will involve some kind of inferencing I can do. But that inferencing is going to have to be applied in the regular systems if it’s customer facing.

“That means when you actually apply the AI to get value out of it, you’ll want to apply it to the data you already have, the things you’re already doing with your customers.”

In other words, for most customers the challenges of AI lie in the production phase and more precisely the ability to (rapidly) curate and add data, and run inference on it to fine tune existing AI models. And then to be able to do that all again when you have the next idea how to further improve things.

Pure Storage EMEA filed chief technology officer Fred Lherault, summed it up thus: “So it’s really about how do I connect models to my data? Which first of all means, have I done the right level of finding what my data is, curating my data, making it AI ready, and putting it into an architecture where it can be accessed by a model?”

Key tech underpinnings of agile AI

So, the inference phase has emerged as the key focus for most AI customers. Here, the challenge is to be able to curate and manage the data to build and re-iterate upon AI models during their production lifetime. That means customers connecting with their own data in an agile fashion.

This means the use of technologies that include vector databases, RAG pipelines, co-pilot capability, and prompt caching and reuse.

Key challenges for storage as it relates to these are twofold. It means being able to connect to, eg, RAG data sources, and to vector databases. It also means being able to handle big jumps in storage capacity, and to reduce the need to do so. The two are often connected.

“An interesting thing happens when you put your data into vector databases,” said Lherault. “There’s some computation required, but then the data gets augmented with vectors that can then be searched. That’s the whole goal of the vector database, and that augmentation can sometimes result in a 10x amplification of data.

“If you’ve got a terabyte of source data you want to use with an AI model, it means you’ll need a 10TB database to run it,” he said. “There’s all of that process that is new for many organisations when they want to use their data with AI models.”

Deal with demands on storage capacity

Such capacity jumps can also occur during tasks such as checkpointing, which can see huge volumes of data created as snapshot-like points to roll back to in AI processing.

Pure aims to tackle these with its Evergreen as-a-service model, which allows customers to rapidly add to capacity.

The company also suggests ways to keep storage volumes from rising too rapidly, as well as speeding performance.

Its recently introduced Key Value Accelerator allows customers to store AI prompts so they can be reused. Ordinarily, an LLM would access cached tokens representing previous responses, but GPU cache is limited, so answers often need to be re-calculated anew. Pure’s KV Accelerator allows tokens to be held in its storage in file or object format.

That can speed responses by up to 20x, said Lherault. “The more you start having users asking different questions, the faster you run out of cache,” he added. “If you’ve got two users asking the same question at the same time and do that on two GPUs, they both have to do the same computation. It’s not very efficient.

“We’re allowing it to actually store those pre-computed key values on our storage so the next time someone asks a question that’s already been asked or requires the same token, if we’ve got it on our side, the GPU doesn’t need to do the computation,” said Lherault.

“It helps to reduce the number of GPUs you need, but also on some complex questions that generate thousands of tokens, we’ve seen sometimes the answer coming 20 times faster.”

Subscribe to Updates

What's Hot

Forget training, find your killer apps during AI inference

Forget training, find your killer apps during AI inference

Don’t buy GPUs; they’re changing too quickly

Find your AI killer app

Key tech underpinnings of agile AI

Deal with demands on storage capacity

Related Posts