Part two: Engineering
In the first part of this series, Alessandro Salomone described why retailers turn to running servers in their stores. Local servers provide a platform for richer digital experiences for customers, more sophisticated back office processes around data pricing and payment and above all enable resiliency in the case of network failure.
In this follow-up piece, we dive into the engineering you need to do it well. For example, we discuss challenges like testing and deployment and find out why containerization makes this all easier and more robust.
Data synchronization
Chris Ford: Doesn’t a local server introduce problems of data synchronization with the central server? How do you handle that?
Alessandro Salomone: Yes, and no. A setup in which a local server is present helps reduce the volume of data exchanged with the central servers, avoiding each and every device communicating with the servers. It also makes operations faster and more reliable, because the devices benefit from data cached in the local server, on a local network that is more stable and faster than the internet.
The presence of a local server increases the complexity of the architecture, which has to be designed to ensure eventual synchronization of the information in store and on the central server. Temporary central server unavailability or unreachability, due to heavy loads or connection issues, can cause temporary desynchronization between the two servers.
For this, we designed the servers to communicate using event queues in both directions, allowing the local server to pull events from the central server and vice versa. Events can be anything from the broadcasting of a product update, the availability of a new promotion campaign, a device reporting a failure or a client checking out. Queues allow the servers to catch up on missed updates caused by a temporary disconnection by replaying all the missed events.
Another challenge worth mentioning is sizing on-premise infrastructure. Load is highly unpredictable in stores. The amount of footfall and therefore digital traffic varies a lot between stores and changes at different times of the year. It’s a good idea to make in-store infrastructure as horizontally scalable as possible so that you can add more capacity where you need it. You don’t want to rely on single large machines that will take the whole store down if they are overloaded, while servers in other stores are underutilized.
Testing and deployment
CF: How do you deploy updates to these local servers?
AS: Local servers were running their logic in containerized microservices. This would allow the development teams to publish new versions of the microservices to a central container registry and wait for the local servers to discover and pull the new container images.
The local servers would have their container orchestrator to check for images on a frequent schedule, to ensure new features and bug fixes could be rapidly deployed to all stores.
When you design your container orchestration, be sure to consider the resource constraints of your in-store hardware. Kubernetes and its ecosystem of tools like Argo CD are powerful and might do what you need, but they are also primarily designed for data centers. You have to find a balance between achieving a lightweight solution and avoiding the temptation to roll your own infrastructure orchestration.
CF: How would you test such a setup?
AS: The containerisation makes testing this setup a lot easier than if we were installing applications directly onto local devices. Testing can be done by replicating the containers and their connectivity on a virtual machine and running manual or automated tests, for example using a continuous integration pipeline. In our case we needed an extra step, which was to virtualise the in-store devices that the microservices solution talks to.
We decided to design the microservices architecture to ensure that every in-store device would have a corresponding containerised microservice that would abstract the device’s data and control interfaces. With a test double for each device microservice, it is possible to replace the original microservice in a test setup. The test containers virtualizing the devices would feature a control port to ensure device data and behavior could be simulated, so as to implement all the required test scenarios, and in fact drive the development.
CF: When deploying to a cluster of servers in a data center, you might test the release with a small group of servers. Is there an equivalent for store servers?
AS: Definitely — in our experience, we built a thin layer on top of our container orchestrator in order to implement a pilot and canary release management. A specific store, or a selected set of stores could be identified as a deployment group for “beta” releases of new software. This would happen regularly and automatically so that all the new features introduced by the new software version would be tested in a controlled environment, and end-to-end customer or store attendant feedback would be collected before releasing the version globally.
Unpredictable environments
CF: A data center is a very controlled environment. A store is less so. What does support look like when you have servers running out there in the world?
AS: To start with, a VPN LAN is a must for security. Incoming connections should be blocked and outgoing connections should be allowed based on an allow-list. This helps to protect your in-store systems from external threats.
Proper, dedicated support channels should be available to store attendants and managers to quickly communicate issues and get them solved. For this reason a first-level line of support would be available to ensure that any store issue would promptly be recorded and redirected to the correct department and, eventually, development team. This requires the development teams to build support guides that help the first-level support line to address the most easy-to-solve issues, or, in case of more complex or unknown issues, to know what information on the user’s experience and actions to collect and pass to the team, that would speed up its investigation and fix.
A ticketing system would be used to record the user’s issue and automatically reach the development team most likely able to contribute to the solution, by ringing the phone and sending an email to the team member on the roster: just a few minutes and the developer would be in conditions to get in contact with the store personnel.
This is where observability becomes essential: support is part of the product development process, and it has to be thought through from the very beginning. Building and running a system that can collect health metrics on the microservices running on the local server, and possibly from the connected devices, is an effective way to be able to observe what is happening in store and quickly spot issues. Also business metrics on the user flows are essential to understand where users (be it customers or attendants) are getting stuck or are experiencing problems.
With appropriate observability tools, like logs, audits and dashboards, the development team can relate the actions of the people in store with the timestamped information coming from the in-store telemetry, thus gaining that visibility that would allow them to spot the lamented issue. Here it is important to ensure the data is collected and presented in a way that can easily tell the story of what has happened.
In urgent or complex cases, it would become more practical for the development team to directly call back the store and ask them to re-enact the situation, in order to reproduce the error. Thanks to near-real-time telemetry, the developers would be able to see what was happening in store and in the software running there on the local server, and promptly release an emergency fix that would be distributed to all the stores within the hour.
High-traffic periods
CF: I recently spoke to our colleague Glauco about building retail systems to survive Black Friday and Cyber Week. How does this architecture stand up to demand in those busy periods?
AS: A nice thing about this architecture is that store systems are designed to run independently, so Cyber Week and load from elsewhere isn’t really a problem. It’s still very important that everything stays up though!
It is in these periods of the year when the advantages of this distributed architecture are especially visible. One of our customers stated that more than 50% of their yearly revenue would come from Christmas shopping: a period in which the stores need to be fully operational at their maximum capacity, without disruptions.
And of course, information like price changes and promotion campaigns for these weeks need to be readily available before the customers flood in the stores. That’s where the idea came to ensure such data to be delivered in the stores even a week before the new schemes would be effective. All information, accompanied by a proper start and end date, would be readily available and usable by the store server and all the other in-store devices, and active only in the expected time period, even if connection to the central service was a bottleneck.
At the end, the central server acts as a business orchestrator, distributing information and stating the expected behavior, as well as a collector for the observability of the business for support purposes and business intelligence insights. All this while the heavy load is handled seamlessly at the local level.
Thank you Alessandro for your insights running business-critical services via in-store deployments.
These kinds of architectures support next generation in-store experiences, but they also give rise to distributed systems challenges not found in traditional data center based systems.
At Thoughtworks we’ve observed local deployments supporting rich in-store digital functionality as part of a movement back to on-premise, not as an alternative to cloud computing, but as a complement to it. We predict that as in-person and online digital experiences become ever more closely integrated, this trend will continue.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.