Generative AI has the power to surprise in a way that few other technologies can. Sometimes that's a very good thing; other times, not so good. This means the question of expectations is right at the center of our experience of the technology: was that really what I expected? Is that output good enough? Or is it just fooling me?
In theory, as generative AI improves, this issue should become less important. However, in reality, as it becomes more ‘human’ it can begin to turn sinister and unsettling, plunging us into what robotics has long described as the uncanny valley.
It might be tempting to overlook this experience as something that can be corrected by bigger data sets or better training. However, insofar as it speaks to a disturbance in our mental model of the technology — I don’t like what it did there — it’s something that needs to be acknowledged and addressed if we’re going to actually leverage and live with AI effectively and safely in the years to come.
Mental models and antipatterns
Mental models are an important concept in UX and product design, but they need to be more readily embraced by the AI community. At one level, mental models often don’t appear to us in everyday life because they are routine patterns that consist of our assumptions about an AI system.
This is something we discussed at length in the process of putting together the latest volume of the Technology Radar, a biannual report based on the experiences over our times working with clients all over the world.
For instance, we called out complacency with AI generated code and replacing pair programming with generative AI as two practices we believe practitioners must avoid as the popularity of AI coding assistants continues to grow. Both emerge from poor mental models that fail to acknowledge how this technology actually works and its limitations. The consequences are that the more convincing and ‘human’ these tools become, the harder it is for us to acknowledge how the technology actually works and the limitations of the ‘solutions’ it provides us.
Of course, for those deploying generative AI into the world, the risks are similar, perhaps even more pronounced. While the intent behind such tools is usually to create something convincing and usable, if such tools mislead, trick, or even merely unsettle users, their value and worth evaporates. It’s no surprise that we’re seeing legislation in this area, such as the EU AI Act that requires the creators of deep fakes to label content as AI generated.
It’s worth pointing out that this isn’t just an issue for AI and robotics. Back in 2011, my colleague Martin Fowler wrote about how certain approaches to building cross platform mobile applications can create an uncanny valley “where things work mostly like… native controls but there are just enough tiny differences to throw users off.”
Without wishing to open up a discussion about mobile development, Fowler wrote something I think is instructive: “different platforms have different ways they expect you to use them that alter the entire experience design.” The point here, applied to generative AI, is that different contexts and different use cases all come with different sets of assumptions and mental models that change at what point users might drop into the uncanny valley. These subtle differences change one’s experience or perception of an LLM’s output.
For the drug researcher that wants vast amounts of synthetic data, accuracy at a micro level may be unimportant; for the lawyer trying to grasp legal documentation, accuracy matters a lot. In fact, dropping into the uncanny valley might just be the signal to step back and reassess your expectations.
Existing mental models and conceptions of generative AI are a fundamental design problem, not a marginal issue we can choose to ignore while we plow forward.
Shifting our perspective
The uncanny valley of generative AI might be troubling, even something we want to minimize, but it should also be a tool that reminds us of the technology’s limitations. It should encourage us to rethink our perspective.
There have been some interesting attempts to do that across the industry. One that stands out is Ethan Mollick — a professor at the University of Pennsylvania — who argues that AI shouldn’t be understood as good software but instead as “pretty good people.”
“What sort of work you should trust it with is tricky, because, like a human, the AI has idiosyncratic strengths and weaknesses,” Mollick writes. “Since there is no manual, the only way to learn what the AI is good at is to work with it until you learn.” In other words, our expectations about what generative AI can do and where it’s effective must remain provisional and should be flexible. To a certain extent, this might be one way of overcoming the uncanny valley — by reflecting on our assumptions and expectations, we remove the technology’s power to disturb or confound them.
Unpacking the black box
Simply calling for a mindset shift, though, isn’t enough. Yes, it’s a first step, but there are also practices and tools that can actually help us think differently about generative AI and address the challenges posed by our mental models.
One example is the technique — which we identified in the latest Technology Radar — of getting structured outputs from LLMs. This can be done by either instructing a model to respond in a particular format when prompting or through fine-tuning. Thanks to tools like Instructor this is something that is getting easier to do than previously. The benefit of this is that there is greater alignment between our expectations and what the LLM will output — while there’s a chance there might be something unexpected or not quite right, this technique goes some way to addressing that.
There are other techniques too. We’re particularly fond of retrieval-augmented generation as a way of better controlling the usually troublesome task of controlling what’s called the ‘context window.’ What’s more, the space is evolving in such a way that we’re seeing frameworks and tools that can help us evaluate and measure the success of such techniques. Ragas is a useful library that provides AI developers with metrics around things like faithfulness and relevance, but it’s also worth mentioning DeepEval, which also features on the Radar.
Measurement is important, but it’s also important to think about the relevant guidelines and policies for LLMs — that’s why we encourage the industry to explore LLM Guardrails — and to also take steps to better understand what’s actually happening inside these models. Completely unpacking these black boxes might be impossible, but thanks to tools like Langfuse, teams and organizations can gain a clearer view on how they operate. Doing so may go a long way in reorienting their relationship with this technology, shifting mental models and removing the possibility of falling into the uncanny valley.
An opportunity, not a flaw
These tools — part of what we’ve described as a “cambrian explosion of generative AI tools” — can help those at the heart of the industry rethink generative AI and, hopefully, build better and more responsible products. However, for the wider world, this work will remain invisible. What’s important, then, is that as well as exploring how we can evolve our toolchains to better control and understand generative AI, we also acknowledge that existing mental models and conceptions of generative AI are a fundamental design problem, not a marginal issue we can choose to ignore while we plow forward.
The uncanny valley of generative AI isn’t a problem to be fixed; it’s an opportunity for everyone to reassess what we really want and expect from this technology.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.