Techniques
Adopt
-
1. 1% canary
For many years, we've used the canary release approach to encourage early feedback on new software versions, while reducing risk through incremental rollout to selected users. The 1% canary is a useful technique where we roll out new features to a very small segment (say 1%) of users carefully chosen across various user categories. This enables teams to capture fast user feedback, observe the impact of new releases on performance and stability and respond as necessary. This technique becomes especially crucial when teams are rolling out software updates to mobile applications or a fleet of devices like edge computing devices or software-defined vehicles. With proper observability and early feedback, it gives the opportunity to contain the blast radius in the event of unexpected scenarios in production. While canary releases can be useful to get faster user feedback, we believe starting with a small percentage of users is mandatory to reduce and contain the risk of large-scale feature rollouts.
-
2. Component testing
Automated testing remains a cornerstone of effective software development. For front-end tests we can argue whether the distribution of different test types should be the classic test pyramid or whether it should be a trophy shape. In either case, though, teams should focus on component testing because test suites should be stable and run quickly. Instead, what we're seeing is that teams forgo mastering component testing in favor of end-to-end browser-based testing as well as very narrowly defined unit tests. Unit tests have a tendency to force components to expose what should be purely internal functionality, while browser-based tests are slow, more flaky and harder to debug. Our recommendation is to have a significant amount of component tests and use a library like jsdom to run the component tests in memory. Browser tools like Playwright of course still have a place in end-to-end tests, but they shouldn't be used for component testing.
-
3. Continuous deployment
We believe organizations should adopt continuous deployment practices whenever possible. Continuous deployment is the practice of automatically deploying every change that passes automated tests to production. This practice is a key enabler of fast feedback loops and allows organizations to deliver value to customers more quickly and efficiently. Continuous delivery differs from continuous deployment in that it only requires that code can be deployed at any time; it doesn't require that every change actually is deployed to production. We've hesitated to move continuous deployment into the Adopt ring in the past, as it’s a practice that requires a high level of maturity in other areas of software delivery and is therefore not appropriate for all teams. However, Thoughtworker Valentina Servile’s recent book Continuous Deployment provides a comprehensive guide to implementing the practice in an organization. It offers a roadmap for organizations to follow in order to achieve the level of maturity required to adopt continuous deployment practices.
-
4. Retrieval-augmented generation (RAG)
Retrieval-augmented generation (RAG) is the preferred pattern for our teams to improve the quality of responses generated by a large language model (LLM). We’ve successfully used it in many projects, including the Jugalbandi AI platform. With RAG, information about relevant and trustworthy documents is stored in a database. For a given prompt, the database is queried, relevant documents are retrieved and the prompt augmented with the content of the documents, thus providing richer context to the LLM. This results in higher quality output and greatly reduced hallucinations. The context window — which determines the maximum size of the LLM input — has grown significantly with newer models, but selecting the most relevant documents is still a crucial step. Our experience indicates that a carefully constructed smaller context can yield better results than a broad and large context. Using a large context is also slower and more expensive. We used to rely solely on embeddings stored in a vector database to identify additional context. Now, we're seeing reranking and hybrid search: search tools such as Elasticsearch Relevance Engine as well as approaches like GraphRAG that utilize knowledge graphs created with the help of an LLM. A graph-based approach has worked particularly well in our work on understanding legacy codebases with GenAI.
Trial
-
5. Domain storytelling
Domain-driven design (DDD) has become a foundational approach to the way we develop software. We use it to model events, to guide software designs, to establish context boundaries around microservices and to elaborate nuanced business requirements. DDD establishes a ubiquitous language that both nontechnical stakeholders and software developers can use to communicate effectively about the business. Once established, domain models evolve, but many teams find it hard to get started with DDD. There’s no one-size-fits-all approach to building an initial domain model. One promising technique we've encountered recently is domain storytelling. Domain storytelling is a facilitation technique where business experts are prompted to describe activities in the business. As the experts are guided through their narration, a facilitator uses a pictographic language to capture the relationships and actions between entities and actors. The process of making these stories visible helps to clarify and develop a shared understanding among participants. Since there is no single best approach to developing a domain model, domain storytelling offers a noteworthy alternative or, for a more comprehensive approach to DDD, companion to Event Storming, another technique we often use to get started with DDD.
-
6. Fine-tuning embedding models
When building LLM applications based on retrieval-augmented generation (RAG), the quality of embeddings directly impacts both retrieval of the relevant documents and response quality. Fine-tuning embedding models can enhance the accuracy and relevance of embeddings for specific tasks or domains. Our teams fine-tuned embeddings when developing domain-specific LLM applications for which precise information extraction was crucial. However, consider the trade-offs of this approach before you rush to fine-tune your embedding model.
-
7. Function calling with LLMs
Function calling with LLMs refers to the ability to integrate LLMs with external functions, APIs or tools by determining and invoking the appropriate function based on a given query and associated documentation. This extends the utility of LLMs beyond text generation, allowing them to perform specific tasks such as information retrieval, code execution and API interaction. By triggering external functions or APIs, LLMs can perform actions that were previously outside their standalone capabilities. This technique enables LLMs to act on their outputs, effectively bridging the gap between thought and action — similar to how humans use tools to accomplish various tasks. By introducing function calling, LLMs add determinism and factuality to the generation process, striking a balance between creativity and logic. This method allows LLMs to connect to internal systems and databases or even perform internet searches via connected browsers. Models like OpenAI's GPT series support function calling and fine-tuned models like Gorilla are specifically designed to enhance the accuracy and consistency of generating executable API calls from natural language instructions. As a technique, function calling fits within retrieval-augmented generation (RAG) and agent architectures. It should be viewed as an abstract pattern of use, emphasizing its potential as a foundational tool in diverse implementations rather than a specific solution.
-
8. LLM as a judge
Many systems we build have two key characteristics: being able to provide an answer based on questions about a large data set, and being next to impossible to follow how it arrived at that answer. Despite this opacity we still want to assess and improve the quality of the responses. With the LLM as a judge pattern we use an LLM to evaluate the responses of another system, which in turn might be based on an LLM. We've seen this pattern used to evaluate the relevance of search results in a product catalog and to assess whether an LLM-based chatbot was guiding its users in a sensible direction. Naturally, the evaluator system must be set up and calibrated carefully. It can drive significant efficiency gains, which, in turn, translates to lower costs. This is an ongoing area of research, with the current state summarized in this article.
-
9. Passkeys
Shepherded by the FIDO alliance and backed by Apple, Google and Microsoft, passkeys are nearing mainstream usability. Setting up a new login with passkeys generates a key pair: the website receives the public key and the user keeps the private key. Handling login uses asymmetric cryptography. The user proves they're in possession of the private key, which is stored on the user’s device and never sent to the website. Access to passkeys is protected using biometrics or a PIN. Passkeys can be stored and synced within the big tech ecosystems, using Apple's iCloud Keychain, Google Password Manager or Windows Hello. For multiplatform users, the Client to Authenticator Protocol (CTAP) makes it possible for passkeys to be kept on a different device other than the one that creates the key or needs it for login. The most common objection to using passkeys claims that they are a challenge for less tech-savvy users, which is, we believe, self-defeating. These are often the same users who have poor password discipline and would therefore benefit the most from alternative methods. In practice, systems that use passkeys can fall back to more traditional authentication methods if required.
-
10. Small language models
Large language models (LLMs) have proven useful in many areas of applications, but the fact that they are large can be a source of problems: responding to a prompt requires a lot of compute resources, making queries slow and expensive; the models are proprietary and so large that they must be hosted in a cloud by a third party, which can be problematic for sensitive data; and training a model is prohibitively expensive in most cases. The last issue can be addressed with the RAG pattern, which side-steps the need to train and fine-tune foundational models, but cost and privacy concerns often remain. In response, we’re now seeing growing interest in small language models (SLMs). In comparison to their more popular siblings, they have fewer weights and less precision, usually between 3.5 billion and 10 billion parameters. Recent research suggests that, in the right context, when set up correctly, SLMs can perform as well as or even outperform LLMs. And their size makes it possible to run them on edge devices. We've previously mentioned Google's Gemini Nano, but the landscape is evolving quickly, with Microsoft introducing its Phi-3 series, for example.
-
11. Synthetic data for testing and training models
Synthetic data set creation involves generating artificial data that can mimic real-world scenarios without relying on sensitive or limited-access data sources. While synthetic data for structured data sets has been explored extensively (e.g., for performance testing or privacy-safe environments), we're seeing renewed use of synthetic data for unstructured data. Enterprises often struggle with a lack of labeled domain-specific data, especially for use in training or fine-tuning LLMs. Tools like Bonito and Microsoft's AgentInstruct can generate synthetic instruction-tuning data from raw sources such as text documents and code files. This helps accelerate model training while reducing costs and dependency on manual data curation. Another important use case is generating synthetic data to address imbalanced or sparse data, which is common in tasks like fraud detection or customer segmentation. Techniques such as SMOTE help balance data sets by artificially creating minority class instances. Similarly, in industries like finance, generative adversarial networks (GANs) are used to simulate rare transactions, allowing models to be robust in detecting edge cases and improving overall performance.
-
12. Using GenAI to understand legacy codebases
Generative AI (GenAI) and large language models (LLMs) can help developers write and understand code. Help with understanding code is especially useful in the case of legacy codebases with poor, out-of-date or misleading documentation. Since we last wrote about this, techniques and products for using GenAI to understand legacy codebases have further evolved, and we've successfully used some of them in practice, notably to assist reverse engineering efforts for mainframe modernization. A particularly promising technique we've used is a retrieval-augmented generation (RAG) approach where the information retrieval is done on a knowledge graph of the codebase. The knowledge graph can preserve structural information about the codebase beyond what an LLM could derive from the textual code alone. This is particularly helpful in legacy codebases that are less self-descriptive and cohesive. An additional opportunity to improve code understanding is that the graph can be further enriched with existing and AI-generated documentation, external dependencies, business domain knowledge or whatever else is available that can make the AI's job easier.
Assess
-
13. AI team assistants
AI coding assistance tools are mostly talked about in the context of assisting and enhancing an individual contributor's work. However, software delivery is and will remain team work, so you should be looking for ways to create AI team assistants that help create the 10x team, as opposed to a bunch of siloed AI-assisted 10x engineers. Fortunately, recent developments in the tools market are moving us closer to making this a reality. Unblocked is a platform that pulls together all of a team's knowledge sources and integrates them intelligently into team members' tools. And Atlassian's Rovo brings AI into the most widely used team collaboration platform, giving teams new types of search and access to their documentation, in addition to unlocking new ways of automation and software practice support with Rovo agents. While we wait for the market to further evolve in this space, we've been exploring the potential of AI for knowledge amplification and team practice support ourselves: We open-sourced our Haiven team assistant and started gathering learnings with AI assistance for noncoding tasks like requirements analysis.
-
14. Dynamic few-shot prompting
Dynamic few-shot prompting builds upon few-shot prompting by dynamically including specific examples in the prompt to guide the model's responses. Adjusting the number and relevance of these examples optimizes context length and relevancy, thereby improving model efficiency and performance. Libraries like scikit-llm implement this technique using nearest neighbor search to fetch the most relevant examples aligned with the user query. This technique lets you make better use of the model’s limited context window and reduce token consumption. The open-source SQL generator vanna leverages dynamic few-shot prompting to enhance response accuracy.
-
15. GraphQL for data products
GraphQL for data products is the technique of using GraphQL as an output port for data products for clients to consume the product. We've talked about GraphQL as an API protocol and how it enables developers to create a unified API layer that abstracts away the underlying data complexity, providing a more cohesive and manageable interface for clients. GraphQL for data products makes it seamless for consumers to discover the data format and relationships with GraphQL schema and use familiar client tools. Our teams are exploring this technique in specific use cases like talk-to-data to explore and discover big data insights with the help of large language models, where the GraphQL queries are constructed by LLMs based on the user prompt and the GraphQL schema is used in the LLM prompts for reference.
-
16. LLM-powered autonomous agents
LLM-powered autonomous agents are evolving beyond single agents and static multi-agent systems with the emergence of frameworks like Autogen and CrewAI. This technique allows developers to break down a complex activity into several smaller tasks performed by agents where each agent is assigned a specific role. Developers can use preconfigured tools for performing the task, and the agents converse among themselves and orchestrate the flow. The technique is still in its early stages of development. In our experiments, our teams have encountered issues like agents going into continuous loops and uncontrolled behavior. Libraries like LangGraph offer greater control of agent interactions with the ability to define the flow as a graph. If you use this technique, we suggest implementing fail-safe mechanisms, including timeouts and human oversight.
-
17. Observability 2.0
Observability 2.0 represents a shift from traditional, disparate monitoring tools to a unified approach that leverages structured, high-cardinality event data in a single data store. This model captures rich, raw events with detailed metadata to provide a single source of truth for comprehensive analysis. By storing events in their raw form, it simplifies correlation and supports real-time and forensic analysis and enables deeper insights into complex, distributed systems. This approach allows for high-resolution monitoring and dynamic investigation capabilities. Observability 2.0 prioritizes capturing high-cardinality and high-dimensional data, allowing detailed examination without performance bottlenecks. The unified data store reduces complexity, offering a coherent view of system behavior, and aligning observability practices more closely with the software development lifecycle.
-
18. On-device LLM inference
Large language models (LLMs) can now run in web browsers and on edge devices like smartphones and laptops, enabling on-device AI applications. This allows for secure handling of sensitive data without cloud transfer, extremely low latency for tasks like edge computing and real-time image or video processing, reduced costs by performing computations locally and functionality even when internet connectivity is unreliable or unavailable. This is an active area of research and development. Previously, we highlighted MLX, an open-source framework for efficient machine learning on Apple silicon. Other emerging tools include Transformers.js and Chatty. Transformers.js lets you run transformers in the browser using ONNX Runtime, supporting models converted from PyTorch, TensorFlow and JAX. Chatty leverages WebGPU to run LLMs natively and privately in the browser, offering a feature-rich in-browser AI experience.
-
19. Structured output from LLMs
Structured output from LLMs refers to the practice of constraining a language model's response into a defined schema. This can be achieved either through instructing a generalized model to respond in a particular format or by fine-tuning a model so it "natively" outputs, for example, JSON. OpenAI now supports structured output, allowing developers to supply a JSON Schema, pydantic or Zod object to constrain model responses. This capability is particularly valuable for enabling function calling, API interactions and external integrations, where accuracy and adherence to a format are critical. Structured output not only enhances the way LLMs can interface with code but also supports broader use cases like generating markup for rendering charts. Additionally, structured output has been shown to reduce the chance of hallucinations within model output.
Hold
-
20. Complacency with AI-generated code
AI coding assistants like GitHub Copilot and Tabnine have become very popular. According to StackOverflow's 2024 developer survey, "72% of all respondents are favorable or very favorable of AI tools for development". While we also see their benefits, we're wary about the medium- to long-term impact this will have on code quality and caution developers about complacency with AI-generated code. It’s all too tempting to be less vigilant when reviewing AI suggestions after a few positive experiences with an assistant. Studies like this one by GitClear show a trend of faster growing codebases, which we suspect coincide with larger pull requests. And this study by GitHub has us wondering whether the mentioned 15% increase of the pull request merge rate is actually a good thing or whether people are merging larger pull requests faster because they trust the AI results too much. We're still using the basic "getting started" advice we gave over a year ago, which is to beware of automation bias, sunk cost fallacy, anchoring bias and review fatigue. We also recommend that programmers develop a good mental framework about where and when not to use and trust AI.
-
21. Enterprise-wide integration test environments
Creating enterprise-wide integration test environments is a common, wasteful practice that slows everything down. These environments invariably become a precious resource that's hard to replicate and a bottleneck to development. They also provide a false sense of security due to inevitable discrepancies in data and configuration overhead between environments. Ironically, a common objection to the alternatives — either ephemeral environments or multiple on-prem test environments — is cost. However, this fails to take into account the cost of the delays caused by enterprise-wide integration test environments as development teams wait for other teams to finish or for new versions of dependent systems to be deployed. Instead, teams should use ephemeral environments and, preferably, a suite of tests owned by the development team that can be spun up and discarded cheaply, using fake stubs for their systems rather than actual replicas. For other techniques that support this alternative take a look at contract testing, decoupling deployment from release, focus on mean time to recovery and testing in production.
-
22. LLM bans
Rather than instituting blanket LLM bans in the workplace, organizations should focus on providing access to an approved set of AI tools. A ban only pushes employees to find unapproved and potentially unsafe workarounds, creating unnecessary risks. Much like the early days of personal computing, people will use whatever tools they feel are effective to get their work done, regardless of the barriers in place. By not providing a safe and endorsed alternative, companies risk employees using unapproved LLMs which come with intellectual property, data leakage and liability risks. Instead, offering secure, enterprise-approved LLMs or AI tools ensures both safety and productivity. A well-governed approach allows organizations to manage data privacy, security, compliance and cost concerns while still empowering employees with the capabilities that LLMs offer. In the best case, well-managed access to AI tools can accelerate organizational learning around the best ways to use AI in the workplace.
-
23. Replacing pair programming with AI
When people talk about coding assistants, the topic of pair programming inevitably comes up. Our profession has a love-hate relationship with it: some swear by it, others can't stand it. Coding assistants now beg the question, can a human pair with the AI, instead of another human, and get the same results for the team? GitHub Copilot even calls itself "your AI pair programmer." While we do think a coding assistant can bring some of the benefits of pair programming, we advise against fully replacing pair programming with AI. Framing coding assistants as pair programmers ignores one of the key benefits of pairing: to make the team, not just the individual contributors, better. Coding assistants can offer benefits for getting unstuck, learning about a new technology, onboarding or making tactical work faster so that we can focus on the strategic design. But they don't help with any of the team collaboration benefits, like keeping the work-in-progress low, reducing handoffs and relearning, making continuous integration possible or improving collective code ownership.
The information in our interactive Radar is not available in your preferred language.
Unable to find something you expected to see?
Each edition of the Radar features blips reflecting what we came across during the previous six months. We might have covered what you are looking for on a previous Radar already. We sometimes cull things just because there are too many to talk about. A blip might also be missing because the Radar reflects our experience, it is not based on a comprehensive market analysis.