Brief summary
Technical debt is a ubiquitous problem in software engineering, yet its causes — and the potential ways to address it — are often context-specific, dependent on the challenges and goals of an organization.
In this episode of the Technology Podcast, Tim Cochran and Ajey Gore join Rebecca Parsons to discuss technical debt in startups and scale-ups. Taking in the causes of technical debt in both types of organizations, the various ways it can manifest itself and approaches and practices for tackling it, the episode dives deep into Tim and Ajey’s experiences leading technology and engineering teams around the world.
Transcript
Rebecca: Hello, everyone. My name is Rebecca Parsons. I am one of your co-hosts for the Thoughtworks Technology Podcast. I would like to welcome today first, Tim, who's a Thoughtworks colleague.
Tim Cochran: Hi, my name is Tim. I'm a technical director, and I'm also working as the head of the scale-up initiative in North America as well.
Rebecca: And we are joined by a former Thoughtworks colleague, Ajey and I worked together quite a bit in the UK. Ajey Gore, would you tell us a little bit about yourself?
Ajey Gore: Hi, thanks for hosting me, it's amazing in terms of going back to kind of our own company, which I joined in 2001. I worked with Thoughtworks for almost 10 years, and then went on a different journey. Currently, I work as an operating partner for technology of Sequoia, India, and Southeast Asia. Before that, I was the group CTO for running the budget out of Indonesia and four more countries, for 16 products.
We had an amazing journey over there. At that point of time, also, I ended up working with Thoughtworks because one of the projects really needed help. We've been in this industry for 24, 25 years. I've been working closely with the startups now. Right now, my passion is mostly helping build startups for the company.
Rebecca: Excellent. Our topic today is technical debt. We want to focus on technical debt in the context of startups and scale-ups because obviously, technical debt arises in different ways. There are some kinds of technical debt that are much more relevant, say, in an enterprise context. I think we're all familiar with the concept of technical debt. Ajey, how do you think about what technical debt is, where it comes from, and when it becomes a problem?
Ajey: If you look at — in terms of technical debt — it mostly comes from two kinds of things in early stages of startups. One is the time pressure, especially in time, we have to ship on time, we have to contract first, we know it's problematic relates to it. Second, in the early days of a startup, the technical debt comes from what you say, innocence, I call it innocence. The reason I call it innocence is because you don't know what you're doing, you're trying to go through things.
I'll use a simple example in a previous project, what we are doing is you're trying to find the nearest driver to the customer. It was first time for me also doing it. There were people who tried to do it for a period of time. One of the things which happened was easiest one is to put a customer in the center and go around circle and see what is the radius and find drivers of it. You can find a driver who is near 10 meters away from you, but so other side of the road. He's on one way, and he's five kilometers away from you on the routing side.
On the other side, there is another driver who's like 20 meters away from you, but he's coming towards you, who will reach you within a minute but we did not know. Technically, the way we started, you put a simple geometry over there, and the site finding drivers, but that was technical and we had to retype the whole thing again, to do a proper routing stuff.
The second iteration, where we did not know was to get bunch of big database servers, store the [unintelligible 00:03:49], store the routes, all the stuff. Eventually, we went to Google’s S2 library and started working on that. In these three stages, we actually kept repaying our technical debt every few months, as we are iterating over the problem. This was a problem of innocence; it was not a problem of recklessness, it was not a problem of not going.
Then there is something called — actually Martin [Fowler] talks about this in one of his posts long back, he wrote it, which he used to call it — software builds up craft. Craft causes changes to take more effect. The technical debt metaphor treats the craft as a debt. That is a second way of looking at it, right? One of the things which happens is most of the time in startups, the innocence technical debt is rarely happens because a lot of people who are building this kind of experience, but time pressure, or recklessness, or how you call it going into this self-inflicted pain kind of thing. I know it's going to cause a problem and I'm going to go do it. That is a technical debt which is more, what you say, occurrence. You'll see that occurrence way more than anything else. That's how I see technical debt builds up and it builds up very fast in early stages.
Rebecca: Well, and I guess one question I would have and if you really don't know your product-market fit, is it reckless to just see what you can get out? You used that term reckless a couple of times. I remember working with a client and he was very much in a research experimentation phase. He completely rejected all of our unit testing and all of that.
He said, I know, I'm at least two instances away from something that I think is really going to work for the long term. To him, because he was willing to throw it away and I'd already seen him throw it away twice, I knew he was actually capable of saying, "Okay, I'm going to start over from scratch." Is it really reckless if you don't know yet what you're really trying to build?
Ajey: If you don't know what you're trying to build, then it's innocence. I use the word innocence as well. If you don't know what you are breaking, this is innocence and that's perfectly fine, and that happens a lot in the early days as well, as long as you are okay with not falling into the trap of the sunk cost fallacy. One of the biggest problems that happens with innocence is that we developers, I have been guilty of that, we write our code and treat it like my precious kind of thing, like very negative connotation over there.
The sunk cost fallacy is so prudent, so upfront in our world, that we should be able to redo it again and again. As long as you're okay with that, then it's more of innocence and you're trying to iterate, ideate and implement, which is perfectly fine. Actually, I always tell people that the only way you can go faster in the startup is ideate, implement, itrate, and then code, deploy after that. That's how you should do it all the time. If you do that, in no sense which plays a bigger part as long as people understand.
People should know I always say one more thing saying every decision whenever you make the decision is the right decision. Only time tells you that it's right or wrong because time gives you more exposure, time tells you more knowledge, time tells you more expertise. Time will tell you if it is right or wrong, but at the point of time which is right and less more with that. A lot of time, people actually end up blaming each other in the ecosystem saying you did that wrong, but it's not somebody's fault. It's just how much they knew. Innocence I will say plays a lot bigger role and people should realize that it was our innocence. That's all.
Rebecca: I expect him as startups transition into that scale-up phase, they're probably going to have a very different perspective on technical debt. How does that fit into what RJ was talking about from your perspective?
Tim: I agree with what Ajey’s saying. The way I think about it is that the startup is leveraging that debt. I think probably what we were saying is that the problem is if you don't know that you've actually generated that debt or not. I think that's the problem sometimes, the company doesn't know. In North America, we have a portfolio of about 20 different companies, and number one, the reason why when we ask what's their problem, why do they think their growth is going to be bottlenecked, it's around technical investment and technical debt.
Often at that point, it's a bit too late because, at that point, the technical debt has grown so much that they're feeling it. They can feel the effects of delivery slowing down, of resilience dropping, customer experience dropping, and that kind of thing. The smarter, and this is probably what Ajey's it is alluding to is like, it's the ones that are able to spot it a bit earlier and know that it's there and deal with it with rewrite and rebuilds, but not of the whole system, of pieces, and you capture it before it affects the business group, and that's the either situation. Most situations, obviously there's the nature of people working at Thoughtworks is that they have a problem, often it is that technical debt problem.
Ajey: Can I add one more thing on this? Another thing which I realize, a lot of time, people actually do not seek help. I'll tell you this: how do you define this behavior where I think I know everything? I think there is one more thing which is about overconfident technical debt, that's what it is. A lot of times, people don't seek help and I always encourage people to go ask people, tweet about it, put a post, put a blog, or reach out to your friends and colleagues. That will help you not to make many mistakes. When I talk to many, many companies I tell them, "Look, I might be a very small packet of success, but I'm cargo full of failure."
I can tell you a 100 ways that things won't work. At least removing those options which won't work will allow you, that will go on to the options which may work, thus reducing technical debt to a great extent thus far. One of the things which people don't do so there may be three types of technical debts, one is reckless, second is innocent, and third is overconfident. That's how the technical debt comes in. As long is it is innocent technical debt, it will get paid, much, much better way, but if it's reckless or it's overconfident technical debt, that means you're making some one-way decisions, which will lead you to pay it heavily, they'll still ask you ransom at some point.
Tim: It's interesting, sometimes we come to those companies and they'd reached series C or D or something, and we come in and look at the tech and it's 10,000 lines of failure and you're like, "How did you survive this long?" Then you're like, "You know what? Maybe this company is really smart," they got to the hyper-growth stage and got enough funding and brought outside help to solve it. They definitely were suffering with low morale of developers and then obviously the product, and those folks being very frustrated about it but it's an interesting question.
Rebecca: Well, when I think about it too from the perspective of how do you convince people to let you pay down the debt, that's often when I'm brought in to explain to the engineering manager, or the VP then engineering, this is why we have to do this. The advantage, if you let it get to the point where your systems are crashing and your customers are unhappy, you've got all kinds of evidence to say, this is the problem, the building is burning down, we have to do something about this. If you were trying to stop it before that happens, how do you find that point? It doesn't seem to me that it would be easy to figure out when I need to start addressing this. I'm curious how you spot that, that point in time.
Ajey: This is one of the most frequent questions I get asked by a lot of developers who we partner with. Their CPO has asked me, "How do I convince my CEO or business person to get me this break?" I tell them two things: one, look, at the end of day, what is your true North star metrics? Get more orders, go faster, do this, do that, whatever it is — what are you trying to do? if you can align that with making-- There was a long back, I think back 10, 15 years back. One of the stories I heard is, "Does this make the car go faster?" and there was an accountant who was trying to do something.
Then somebody, "How would this make the car go faster," and accountant says, "No, I'm putting a better expense management system so drivers are spending less time on finding expenses so they get more time to practice. If they get more time to practice then eventually they will be able to get more practice and the car may go faster." That's a very indirect way of aligning technical debt or something like that to success to a thing. That's exactly what we did at Gojek multiple times is like, look, we need to improve our developer experience, which is a totally non-functional requirement. It's like there's no way we can align that. If we can go and talk about saying, this is what it will end up.
A lot of time I always tied up our technical debt as money to the business people. It's a monetary, this is what is a product, or this what is a growth impact, invention. Also, whenever you're paying the technical debt, the result, or the gratification is there is no instant gratification. If technical debt accumulates over a period of time, paying off technical debt also rewards you over time. There is no instant debt. People need to understand this, that if you go and solve technical debt, you pay it as an EMI, you pay it in installments, and also the gratification is long-term; it's not instant.
As long as you have these two things saying, align every effort to the money at the end of day, and align every gratification to the power of compounding over a period of time, then only you can measure it over the period of time, and then you will see the value of it. We did take firebreaks in Gojek once, so I'll tell you a little bit of Gojek. We launched around 16 products in nine months in 2016. That means you're launching around one and a half products every month, or two products every month, almost. We acquired a lot of technical debt all over the place. A few of it was deliberate and some of it was innocent technical debt.
We also had overconfident technical debt in some places, but all of our time — to put a perspective, we are doing around 5,000 orders per day in 2015, by the end of 2016 —, we were doing around one and a half million orders per day. That scale was crazy. Given that scale and everything whatever makes more orders is the right thing to do. We accumulated a lot of technical debt and we just said in December 2016, let's take a firebreak, let's just stop everything and repay this. That's one of the things which never works but you can only achieve 50% of it because you cannot stop a running engine, there is no way you can stop running engine.
It's like a running car, you are always on track, you're refueling on the track, you are putting the tires on the track, somehow you are doing everything on the track. That's what start ups are but taking pit stops are better. We started learning multiple pit stops on the way and it stopped for a small time and do one small things and then again, gain the speed. Firebreaks don't work but taking pit stops works a lot, but you need to have multiple pit stops, not every lap, but like 10 pit stops in one lap, where you're constantly still have a sense of moving and not stopping, but also you are getting all of your time, you can remove the pit stop and get better at this point. Basically, at the end of day, what we realized is that aligning to business objectives, and waiting for the gratification quality of time, makes a lot of sense.
Rebecca: I'm drawing inferences here that I want to make explicit. A pit stop is just a very short maintenance break, where you go in, you solve an isolated problem, and then go on. What is the scale of a firebreak? I'm assuming, we're going to pause, we're going to isolate, and then we're going to do a lot of work. Is this an iteration, is this a quarter?
Ajey: Yes, a firebreak is more than an iteration. A firebreak is like, think about this, you have fire in a jungle and this fire is reaching you. What do you do? A lot of the strategies are in place, but what are the other strategies? You put a fire at the other end and that fire actually goes towards that fire and stops the rest of the jungle from growing. That means a firebreak is, you're trying to not douse the fire, ignite more, so you're creating more chaos, but on the other hand to bring stability. It cannot be one iteration. It is close to like four, five, six iterations.
Our firebreaks, first firebreak was two months long, and the second firebreak was like one month long. The second firebreak worked very well. The reason was that we took it during 2017 Christmas and also, we counted it as part of iteration and not as a firebreak. Earlier, we counted iterations as part of firebreaks, but this time we counted firebreak as part of iteration. That is much more understanding and rest of our measures going on, we're only fixing one part. Stopping and fixing everything in firebreaks rarely works if you have the engine humming all the time, but making firebreaks part of your iteration, that means you are only replacing a tire or only replacing a piston or something like that, then it makes much more sense. One of the things we learned is make firebreaks part of iteration so not iterations part of firebreaks.
Rebecca: Thank you. Tim, do you think the justifications for when to start paying off technical debt, does that vary between being a startup and being a scale-up from what you've seen?
Tim: That's kind of a scale, right? I think that there are inflection points — what might happen is a team might know how to work around the sharp edges, because they created all the technical debt so they know how to avoid it. One inflection point is when a team is about to expand rapidly because often technical debt is felt in onboarding because you have to learn all the weirdness and complexities in the code.
If by scaling or growing, I think that's one point. It could be scaling by headcount and the same could be true about when you're adding more customers and things like that because perhaps there might be places which you haven't automated, which are okay because you are only adding a customer every so often. There's a point where [chuckles] it becomes a problem and the developers are spending most of their day actually doing that [laughs] instead of actually developing.
It's one of those things, any kind of scale just exponentially increases those problems and make it be felt more. Probably if you're planning for some amount of increase and you're getting some funding associated with it, that's a point to actually look and dedicate some of that funding to improving your technical platform, I would imagine.
Ajey: Yes, makes sense actually. In a lot of ways, you can also look at some of the metrics which can actually help you assess and see what it takes to improve this metric's functional and nonfunctional performance. That's how you can actually get much more easy alignment to allocate some money to that.
Tim: Yes, that's the kind of thing, I feel like it's what you were saying, maybe in an enterprise setting, you don't always have to justify everything, but I think because resources are so constrained, everything just has to be justified. I think the importance are those metrics and I think it's good now that people are getting much more familiar with DevOps metrics and developer experience metrics, and it's not so unusual for the CTO and the whole exec team to understand those and to appreciate them. I think in the past, that was maybe more of a problem for developers to advocate for.
Rebecca: Go ahead, Ajey.
Ajey: I was saying on metrics, I have a very interesting experience. One of the things which we did at Gojek was we flipped the metrics. We said, "Developer metrics are business metrics, and business metrics are developer metrics." One of the OKRs, when we did an OKR session in Gojek, I remember 2018, 2019, one of the company metrics was uptime.
That actually helped us a lot in terms of whenever business, when we started talking about it, and they asked developers, support managers, they were saying, "Will this affect uptime?” We were so happy to see that awareness. “If it is going to affect the uptime, then don't do it right now. Let's prioritize some other time." People started looking at each PERT features or-- Then we started getting the PERT features or business requests, which are aligned with PERT engineering metrics as well. On another hand, the completed orders became PERT engineering metrics from day one I joined.
The reason was that once we can't — One day we did like one million orders, sometimes we did only 900K. Then developers will get very extreme. Why is it 900K? Was there a holiday? Then we start figuring out things, oh, there was a holiday. Oh, this rain happened. By the way, Gojek is a motorcycle ride company. That means motorbikes cannot fly when there's rain. If there's rain then we'll have less orders.
Developers started worrying. We started plotting, we started looking at the weather forecast and stuff saying, will it be a good day to sell products or not? If the metrics around a lot of times when you make business metrics as part of developers ecosystem and product engineering metrics as part of business ecosystem, people start caring about it and they start making sure that they do the right thing, which creates conveniences on both ends.
Tim: I think that yes, I do see that now more that the technical metrics are part of the business strategy. I think another example is, if a company's really trying to hire high-quality engineers, then they have to monitor the developer experience and the engineering satisfaction and monitor the amount of friction in order to actually be able to retain those folks. I think there are companies that for them as their business strategy is having these top quality engineers and therefore you have to track some of those DX metrics associated with it.
Rebecca: Yes, and you've both mentioned morale in various ways in this and that. I do think that that is critical, because, particularly in the startup and scale up environments where things are changing so rapidly, the frustration builds more quickly. If you feel like you don't have the right tools to do your job, or there are fixable problems that would make life better.
I liked what you said earlier, Ajey, about how paying off the technical debt can provide, not just, okay, now we don't have to worry about the debt anymore, but that satisfaction, that gratification. I've seen that on some of the projects I've been involved with where they feel like they have hope again. Whereas there was the pit of despair and then all of a sudden, oh, wait a minute. There is light. There is hope. I think that's an important thing. Particularly given how hot the talent market is at the moment it makes business sense to worry about developer effectiveness.
Ajey: Yes, it does. I'll tell you: every high-growth or scaled-up environment can be treated in two ways. One, it can be treated as a bit of disparate as much or second, it can be treated as passion and energizing things and trying to get something done every day. It also depends on how you portray the problem. Lot of time we actually portray the business problem as constraints rather than excellent engine problems. We are stuck over here and this is what our fate is.
Instead of that, you say, no, this is one of the most complex engine problems we're solving. I have seen morale going down and going up, both ways. One of the biggest problems, which I have seen, is not celebrating these small successes. If you don't celebrate the small success while you are fighting the war or whatever we want to call it, like, we want put as much negative connotation as you can. If you don't celebrate with daily wins, morale goes really, really bad. The first thing is that.
Second thing, what makes a lot of sense in this kind of world is that people need to understand that it's not a daily grind. A lot of time people say, okay, we come, we fight, we die. We go home and we fix ourselves. Again, we come, we fight, we die. No, it's not like that. It's mostly like we are pushing something daily somewhere. One of the things which made a lot of sense for us to actually place our dashboards all over place in the organization. We had big monitors and placed our dashboards everywhere: how many rides we completed, how many tons of food we delivered. We used to have this metric set, how many round trips to the moon we did today, because that is the cumulative kilometers of Gojek drivers. We had one million drivers. Even if a driver goes 40 kilometers a day, we're talking about 40 million kilometers a day. This is where a lot of engineering leaders need to understand that these things go hand in hand with small wins, get a larger fat one. You need to understand how to build a balance between the two, and if you can keep doing that, then everybody will be happy and not that much despair in the air.
Tim: I think there's an importance about — It's like transparency of information and strategy. I think sometimes, people get frustrated. There's nothing worse about having these kinds of technical problems and having it not recognized. It might be that it's recognized, but it's just not important at the moment. I think, sometimes I've seen engineers, or folks just getting frustrated, and a lot of it's because they haven't been shared the actual business strategy right now, or product strategy, and say, "Well, we have to optimize this thing." We know, but that's not the focus at the moment.
I think sometimes, when that information is shared and the engineers internalize it, then they might be okay with certain areas having technical debt because it's not important right now. We have a client right now where we found a lot of scaling problems in the data pipeline and the team really wanted to fix them, but we're still trying to find product market fit. When we have enough customers that the data pipeline is a problem, that'd be a good problem to solve, but for the minute, let's focus on the features that customers care about.
Rebecca: I think that goes back, Ajey, to what you were saying about flipping the metrics — making sure that the development teams understand what the business is prioritizing at the moment, and then they can use that to inform the decisions that they're making around, "Okay, well, if this is what the business priority is, I better go take a look at that because that is going to have to change a lot if this is what our business priority is." I think those two ideas are related there.
We've talked a little bit about where technical debt comes from but what about kinds of technical debt?
I remember reading an analyst report years ago, which was talking about the technical debt that is associated with the version upgrades that haven't happened on your packages. That's clearly one source of technical debt. Another source of technical debt might be something algorithmic, like, Ajey, you were talking about, with the driver selection. Tim, what are some other kinds of technical debt that you've run across?
Tim: It's interesting: when you think about debt, you almost think it was intentional. Like we intentionally missed something and didn't automate something, or something like that. It's particularly a problem with start-ups because it comes about via perhaps building a design for something and then the strategy changes, and that design is no longer appropriate. We often examine or see startups where we feel like the architecture has been overfitted to a particular problem, it's been optimized for a particular problem but the actual business and product are still pivoting a little bit.
Then you end up with that problem that Ajey was talking about with the sunk cost, because you try to change it to fit the new paradigm. Sometimes that complexity comes from just the pivots.
The other thing that I've seen is a little bit of, and again, I think it comes about with scale-ups, particularly because there's, especially at a certain point, a scale-up will just try to build a lot of features. When we examine a code base, we often find a lot of code that just isn't used, and it was because perhaps the start-up was maybe very self-driven.
They were building a lot of features that you could put on a checklist but actually never used. Also, perhaps a lot of edge cases.
Every developer knows that it's that edge case that disqualifies your model and has to change it. Sometimes what we see, if there isn't enough dialogue between tech and product, sometimes the tech team will go and create this elaborate architecture to handle an edge case, that perhaps, it should have just been handled with some cheap script or something, rather than accommodated into the core model, but that conversation between products and tech wasn't happening enough. Yes, spending a lot of time on stuff that isn't really that important and doesn't get used that much, that's one source I see. Feature bloat I suppose you could call it, but yes.
Ajey: There's one more kind of technical debt, which I think a lot of time comes by underestimating the accidental complexity. A lot of times when you're doing essential versus accidental complexity, when you're trying to use some SAAS, trying to get some third party library — I'll use simple example. Suppose you are trying to go to Starbucks or trying to go to some nice coffee point, needs coffee or somewhere from your hotel or place every day, office. The essential complexity is that you should know the way that you have to go there.
The accidental complexity is you have to drive on that road and then you do not have a bunch of other people's behavior under control. There is accidental complexity of not being able to follow the traffic rules because somebody came and crashed into you or you took a wrong turn because you don't have maps. If you have a proper map, then you can reduce this accidental complexity too much, and that's exposure knowledge.
A lot of time, people look at essential complexity and estimate it, but they never looked at maps or navigate the thing which can actually reduce the accidental complexity of your time. A lot of time the technical debt comes because of this accidental complexity. I'll give you some examples, for example, maintaining the cache. People know why we're doing the cache, but they don't understand sometimes that few other cache servers have signal failure. That is accidental complexity they need to deal with when there are a lot of concurrent applications.
When you are looking at application-level sharding or data-level sharding, the essential complexity is that you have many shards. Accidental complexity is that you need to deal with the network layer, you have to deal with synchronization in order to deal with shards. You need to deal with multiple things, and that is accidental complexity.
What they do, they try to do monkey patching for that one side of parameters in capture, they will just try to fix availability and nothing else, and then everything else goes for toss. That kind of technical debt is way, way difficult to solve in the future. I have always tell people that, whenever you're looking at something as essential complexity, have you dig enough to find the accidental complexity associated with it? If you don't, then please spend five more days, ten more days, 20 more days, finding that and try to give me a solution for those things, instead of just implementing something blindly. That is one of the biggest sources of grief I had during my Gojek days.
Rebecca: We could talk about technical debt forever, but this has been great fun. Thank you, Tim. Thank you RJ, for your insights into various forms of technical data and how startups and scale ups can at least be a bit more deliberate about how they take advantage of technical debt. Thank you, Tim. Thank you Ajey.
Neal Ford: Join us for the next episode of the Thoughtworks Technology Podcast where we talk to our colleagues Georgina and James about the basal cost of software development and maintenance. Hope to see you then.
[music]
[END OF AUDIO]