Productionalizing AI agents

While building Brayniac, I wanted to share some challenges and solutions that might generally apply to building AI agents using large language models (LLMs). The challenge lies in how to harness the power of LLMs to create effective and safe AI agents that act as part of an application.

An AI agent is a computer program that takes user prompts and performs actions based on them. These actions can range from data analysis to internet research, and even tasks like making restaurant reservations or booking flights. The possibilities are endless, but the commonality is that the LLM gets user input and takes actions using common computer tools such as a code interpreter or APIs.


LLMs are incredibly powerful, and AutoGPT has shown what is possible with AI agents. However, while AutoGPT has blown everyone's mind it is also still suffering from many issues. Even simpler agents such as the LangChain Python Agent come with challenges when integrated into an app. Namely:

  • Security. Is this approach safe to deploy on your server without exposing you to hacks.
  • Controllability. Does it provide a predictable and controllable user experience.
  • Scalability. Can the approach handle a complex suite of tasks successfully and bug-free.

Here are two different approaches to building AI agents. Each one has a different trade-off in terms of the above challenges.

Approach 1: Restricting the tools

The first approach involves restricting the LLM to a predefined set of actions using rules or syntax. This has actually been mentioned by @yoheinakajima himself. We can provide the LLM different actions it can take via the prompt and how it can provide input parameters. We then parse the LLM output and extract what action should be taken with which parameters. For example, in data analysis, we can use a predefined markup like "MEAN X". By checking if the output contains "MEAN" and "X," we can compute the mean of the column X. This can also be implemented via LangChain tools.

Example of a LangChain tool that can calculate summary statistics for a dataframe df.
A LangChain tool that can calculate summary statistics for a dataframe df which is in memory.

The advantage of this approach is that it is safe, as we precisely define what code is executed. We can also clearly define the scope of the agent's capabilities, which helps predict user inputs and allows us to control the overall user experience. The downside with this approach is that it is not very scalable. For one, you have to manually code all the possible actions that can be taken. Secondly, dealing with multi step sequences of tasks where the output from one step is the input to the next step can be difficult. Given how AutoGPT has made the news, users have quite high expectations of what an AI tool can do, and from my experience this approach can result in something that underdelivers with respect to user expectations.

Approach 2: Letting the LLM code

The second approach is to let the LLM write code and execute it. This gives us incredible capabilities and it is quite easy to get started. All you need is to describe to the LLM that it should write code, implement logic to parse the output of the LLM, and execute the code generated by the LLM. The advantage of this approach is that it easily scales to very complex tasks and in practice even GPT3.5-turbo seems to generate bug-free code in 99% of cases.

However, this approach also comes with major risks and should put you into defense mode from the start. If you pipe the user's input straight to the LLM and execute the LLM generated code, you give the user the ability to execute any code on your server. Not only does that mean that they can use your server for whatever they want, but they can also print out any sensitive information from the server environment such as credentials and API keys.

Executing LLM generated code on your server.

The other challenge of this approach is that you don't have very much control over the code that is executed. Going back to the data analysis example, say you want all generated plots to have a light gray background. This is actually quite hard in this case since you don't have control over the code for the plot. You can try to include this as an instruction in the prompt, however it doesn't guarantee that all plots that come out have a light gray background and modifying the prompt may not work for complex scenarios at all.

Heavily branded plots will be difficult to do with LLM generated code.

So in summary, while letting the LLM write code is incredibly powerful, it comes with some major security concerns and lacks fine grained control.

Addressing security concerns with Approach 2

To make the execution of LLM-generated code safer, there are a few options we can consider. OpenAI's approach focuses on creating a safe sandbox environment. As explained there, we can execute the code in an environment without internet access and strict resource limits. Alternatively we could set up the app in a way where we never execute generated code on the server but only on the users device, for example using WebAssembly to execute in the browser or a native app to execute directly on the users computer. A third option is to vet the code for potential risks before executing it. This could be done using manually written rules or by using another LLM to detect unsafe code.

Conclusion and Future Work

In summary, building AI agents with LLMs is an exciting but challenging task. I've discussed two approaches, their advantages, and disadvantages, and offered some options to enhance the safety of executing LLM-generated code.

In the future I would also like to explore how to leverage fine-tuning to create AI agents and meet the challenges described here. In theory, a fine-tuned LLM should be able to reject requests that lie outside of its defined scope such as revealing sensitive information. It would also offer control over the generated code by means of the training data. However, with fine-tuning comes the challenge of collecting and managing a training dataset. And it remains to be seen how well fine-tuning works to guide the models behavior.

It's an ongoing journey to figure out how to build production-ready AI agents. If you have comments or questions, please reach out to me under @lars_hertel on Twitter.