More speed is what AI Agents need
I now believe that more speed is the most immediate lever we can pull to build better AI Agent applications.
Before plunging into the world of AI Agents, I used to rank a model’s reasoning capabilities as the most important factor in choosing the appropriate LLM for my tasks. Cost is also an important variable, but since I don’t (yet) have have an AI product with lots of customers I haven’t paid much attention to it. Speed was probably the least important of these 3 factors.
Until now.
Since I started developing my own AI Agent library, I realise how important it is to have access to a model running on hardware that can output a high number of tokens/second. This realisation turned into conviction upon trying Llama 3 on Groq a few week ago. I even went as far as implementing a Groq Agent in my AI Agent library so that I can use it as part of my workflow.
In this post, I will briefly explain what Groq is. Then I will expand on what benefits fast inference unlocks for AI Agent systems.
What Is Groq?
Groq is a hardware company that was founded in 2016 by Jonathan Ross. Before founding Groq, Ross led the development of Google’s Tensor Processing Unit (TPU). During his time at Google, Jonathan saw first hand how much larger the inference market would become compared to the market for training deep and large language models.
Groq is developing a new kind of chip called the Language Processing Unit (LPU). In this new age of LLMs, the LPU is not designed for training new models but to maximise their inference speed. Groq’s chip aims to overcome the compute density and memory bandwidth bottlenecks observed when running LLMs on traditional GPUs. It does so by clustering together lots of LPUs, which enables faster computation. Additionally, Groq doesn’t use external memory; the models are fully loaded on its chips’ on die-memory, which reduces interruptions that traditional GPUs must contend with to access model parameters via external RAM.
To demonstrate the impressive capabilities of their hardware, Groq have created a chat interface as well as an API that enables users to interact with the best open source models they host.
I recently tried Llama 3 on both web and API formats and was blown away at the speed with which I was getting answers back. These experiences completely changed my views on the importance of speed in the criteria required to deliver amazing AI based applications.
What faster inference unlocks
Higher reliability
One of the common issues when building AI Agent systems is their low reliability. This problem becomes even more acute in enterprise settings, where low reliability is a deal beaker.
We expect upcoming models with more powerful reasoning capabilities to help mitigate these reliability issues, but I doubt this will be sufficient for agentic systems due to their more complex nature.
At the moment, adding a reflection step enables AI Agent systems to (self-)correct their answers. However, this extra step, which could involve multiple exchanges between agents, can add significant latency to an application. As a result, this step is sometimes skipped in AI Agent systems, thereby negatively impacting reliability.
With hardware like Groq that can deliver breathtaking tokens/second, implementing a reflection step becomes a very attractive proposition because it won’t significantly impact the latency of an agentic system. As a result, this encourages developers of AI systems to build more elaborate reflection modules, which will lead to even more reliable systems.
Better UX
Most agentic workflows are currently very slow. This is because unlike in traditional chat exchanges with LLMs like ChatGPT, where few messages are sent and received, AI Agent based systems involve a multitude of messages sent across multiple agents.
Currently, the latency for a call to a standard, non-accelerated LLM is measured in hundreds of milliseconds or even seconds. Since in AI Agent based systems dozens (and sometimes more) of messages are exchanged, it takes several minutes to get an output from the system.
This is not a great user experience.
According to research from the usability experts of the Nielsen Norman Group, users implicitly anchor to 3 response time limits:
0.1 seconds or below: this gives the feeling of instantaneous response. It makes the user feel in control of the interaction.
1 second or below: this keeps the user’s flow seamless. They notice the delay but remain reassured that they are in control of the interaction.
10 seconds or less: this keeps the user’s attention. While they feel at the mercy of the machine in this case, they are still optimistic the task will complete.
In many usability studies, a delay over 10 seconds could mean a user leaving a website or abandoning the interaction with a system. And even when users stick around for more than 10 seconds, they may have trouble understanding what is going on.
In a nutshell - speed matters to users.
With faster inference, a dozen of AI Agent calls would complete just as fast as a single one, thereby reducing the overall latency of the system. Workloads that before had to be presented as asynchronous to manage users’ expectations (and patience) could now be completed synchronously in a timely fashion. This would provide a much more natural, seamless user experience.
New use cases
The truth is, more tokens/seconds unlocks so many more use cases that we haven’t yet fully envisioned.
For instance, OpenAI recently released their new flagship model called GPT-4o. While this model retains the “GPT-4” prefix, it is different in nature from its predecessor model because it is natively multimodal.
GPT-4o was been trained on text, images, and audio simultaneously. In contrast, the current GPT-4 was trained on text; so when you ask it on ChatGPT to analyse an input image it delegates to another model trained specifically on images. GPT-4o does all of this from the same model, without delegation to another one. Of course, there are obviously more optimisations and other tricks that the clever folks of OpenAI incorporated in this model too.
The result of all this good and hard work is a model that is 50% faster than GPT-4-turbo. Additionally, this native multimodality allows novel use cases like real-time audio translation, tutoring, and more low-latency interactions.
Closing thoughts
With the current pace of innovation on model reasoning capabilities, speed, and competitive price, it feels like we are inching ever closer to a world where interactions with AI Agents won’t be as slow, clumsy, and pricey as in our present reality.
Of those 3 core criteria, I believe speed is the one that has been downplayed for too long. But I am glad to see that more people (including me) are realising how critical it is to creating better AI Agent systems.
And just like Steve Jobs used to proclaim 1000 songs in your pocket about the iPod, we may soon be able to state 1000 co-workers in the cloud when it comes to AI Agents. In the near future we could even go as far as claiming infinite intelligence in your pocket, once we can run very powerful models at low latency on our smartphones and other devices.
This is a brave new world. I’m excited to be a part of it and contribute to it.
I recently launched Kiseki Labs, a consultancy helping businesses implement GenAI through workshops, strategic advisory, and custom solutions. If you're interested in working together, you can book a free consultation at kisekilabs.com or connect with me on LinkedIn.