Artificial intelligence (AI) „agents” that have recently been rolled out can use computers by taking control of the keyboard and mouse. These AI agents can use a computer just like a person would and can receive instructions in English. „By end-2025, these AI agents are expected to become commonplace and to change how we work and use the internet.” – according to the BIS Working Papers No 1245 – Putting AI agents through their paces on general tasks.
The authors of the report assessed these AI agents on the general skills needed to play games such as Wordle – recognising when tiles change colour, using feedback to arrive at informed guesses and inputting those guesses using the computer keyboard and mouse.
A key question on the role of AI agents is whether they will enhance human abilities by serving as „co-pilots” or be good enough to become autonomous tools that can displace human operators.
The authors said: „Our test is designed to shed light on this question. AI agents display impressive abilities in narrowly defined tasks such as standardised tests in mathematics, medicine and law as well as other tests that they can be trained for. However, more is required for general tasks in the real world. To handle real-world tasks, AI agents need to solve many related tasks, understand complex situations, deal with inconsistencies and, importantly, be able to self-assess, experiment and autocorrect. By asking AI agents to play games such as Wordle using a keyboard and mouse, we put them through their paces to see how they measure up to humans.”
The authors found AI agents to be wanting along key dimensions. They are impressive at narrowly defined tasks. But they lack the self-awareness to know when they have gone wrong and change course in light of evidence, and they lack the ability to experiment in the optimal way to remedy their ignorance. They make too many mistakes in some repetitive tasks, which slows down progress towards solving the overall problem. Humans also make similar mistakes, but humans excel at recovering from them. In contrast, AI agents get stuck because they cannot experiment and learn from their errors.
While future generations of AI agents may overcome these shortcomings, in the near term, AI agents are more likely to be used as co-pilots that enhance the work of human operators rather than as autonomous agents that displace human workers.
Multimodal large language models (LLMs), trained on vast datasets are becoming increasingly capable in many settings. However, the capabilities of such models are typically evaluated in narrow tasks, much like standard machine learning models trained for specific objectives. The authors of the report take a different tack by putting the latest LLM agents through their paces in general tasks involved in solving three popular games – Wordle, Face Quiz and Flashback. These games are easily tackled by humans but they demand a degree of self-awareness and higher-level abilities to experiment, to learn from mistakes and to plan accordingly.
The authors find that the LLM agents display mixed performance in these general tasks. They lack the awareness to learn from mistakes and the capacity for self-correction. LLMs’ performance in the most complex cognitive subtasks may not be the limiting factor for their deployment in real-world environments. Instead, it would be important to evaluate the capabilities of AGI-aspiring LLMs through general tests that encompass multiple cognitive tasks, enabling them to solve complete, real-world applications.
BIS Working Papers No 1245 – Putting AI agents through their paces on general tasks
Banking 4.0 – „how was the experience for you”
„So many people are coming here to Bucharest, people that I see and interact on linkedin and now I get the change to meet them in person. It was like being to the Football World Cup but this was the World Cup on linkedin in payments and open banking.”
Many more interesting quotes in the video below: