Scaling up at Etsy

Podcast host Mike Mason | Podcast guest Tim Cochran , Keyur Govande and Mike Fisher

May 18, 2023 | 51 min 02 sec

Listen on these platforms

Brief summary

Global craft marketplace Etsy has grown at an impressive rate in recent years. From 2019 to 2021 sales and revenue tripled. This growth has been enabled by a significant technology modernization project which, amazingly, was completed just weeks before the Coronavirus pandemic erupted in March 2020, the start of a period in which millions of people took to Etsy to purchase cloth face masks. Without the modernized systems and infrastructure, Etsy would have struggled to cope with consumer demand.

In this episode of the Technology Podcast, Mike Mason is joined by Etsy's Chief Architect Keyur Govande, the company's former CTO Mike Fisher and Thoughtworks North America Technical Director Tim Cochran to discuss how Etsy tackled the challenge of scaling to meet the needs of its expanding market. They talk through the technical challenges and the organizational focus required to scale in a way that was sustainable for the business yet impactful for Etsy's users.

Read Tim Cochran's article about evolving Etsy's culture on martinfowler.com
...And his piece on using the cloud to scale
Learn more about Thoughtworks partnership with Etsy

Episode transcript

Mike Mason: Hello, and welcome to the Thoughtworks Technology podcast. My name is Mike Mason. I'm joined today by Tim Cochran, who is a tech director at Thoughtworks and he's currently running our scale-up studio. Hi, Tim.

Tim Cochran: Hi, Mike. I'm super excited to be here and talk about this topic.

Mike: Awesome. The topic today is scaling Etsy. Etsy — everybody knows them. They are a global destination for unique and creative goods. Their core challenge is that they have more than 100 million listings, and they need to connect 95 million buyers with 7.5 million sellers. They tripled their gross merchandise sales and their revenue from 2019 to 2021. As you can imagine, that's quite a steep growth curve. We're joined today by two folks. First of all, we have Keyur Govande, who is the Chief Architect and VP of Engineering at Etsy. Hello, Keyur.

Keyur Govande: Hey, Mike, how's it going?

Mike: Awesome. Thank you for being here today. We have Mike Fisher. We're going to call him Fish to avoid the confusion with me. Mike Fisher is the former CTO of Etsy. Hi, Mike.

Mike Fisher [Fish]: Hi, Mike. Thanks for having me.

Mike: Awesome. Scaling Etsy: we had a little taste of some of the numbers in terms of the amount of products sold and the revenue growth. I think we can all imagine that there's underlying infrastructure and technology growth to support that, but we'll start the story back in 2017. Fish, you joined Etsy as CTO in 2017. Can you tell us a bit about your approach to looking at scaling as you arrived in the organization?

Fish: Yes. I think our scaling story, at least for me, definitely starts in 2017. When I first arrived, there was a couple of things going on. One, we anticipated we wanted to be able to really reignite growth. That was in the minds of folks at that time. There were definitely some areas that we felt were holding us back, one of which was that we were still in data centers. As I think about this, I would call it tech debt in our data center architecture. In that we were in three data centers, two primary, and one backup, but the two primary really were required both to run.

That had just happened over time, adding services, and, of course, doing things quickly, which is very normal in a company that's going very rapidly like Etsy was. Suddenly, one day you turn around, you're like, "I've got myself into something I've got to get out of." We certainly architected around that with the data centers and redistributed services and stuff, and instead, we looked at things like the timeline for acquiring hardware. One of the stories that I recall was we had a Hadoop cluster that was running at capacity, and so we put, on order, a new cluster.

I think it was about 400 nodes. That took, let's call it four to six months to get the approvals for the hardware, get it in, establish it. Within 24 hours, the new Hadoop cluster was at 100% utilization. You could clearly see this was holding back the teams. The teams wanted to move quickly, and this process that we had, and the timelines weren't really working for us. That's when we started thinking about, let's get to the cloud where we'd have practically unlimited capacity, and the ability to move as quickly as our teams wanted to, and quite frankly, could. Our teams could move incredibly fast. That was the thought around, initially, why we should start this migration to the cloud.

Mike: You worked with Keyur on that. The two of you, as I understand it, built this migration approach. Were you using any cloud services at the time? How did you start looking at that migration?

Fish: I actually think we were using a non-Google cloud service as maybe storage at the time, so we had some experience. I would say I'm glad Keyur's here because I can't express enough how much Keyur was instrumental in this migration. He's a very humble technologist. He not only is a brilliant technologist and a super kind, caring mentor to tons of our engineers, but he also stepped up to lead this migration. I'm really glad he's here to talk with us. Then I think even Keyur would agree, we're here representing literally hundreds of people's worth of work. That this was just an enormous team effort. I think at one point, we had about 25% of our engineering capacity working on the migration in some way. Keyur was certainly instrumental in this, so I'm glad he's here to talk about it.

Keyur: Thank you for those kind words, Fish. Yes, I could not agree more in terms of how many people at Etsy played a part in executing this migration and have it be successful. I think for us, like Fish said, we were on Google Cloud, but our experience in the cloud was limited. We had existed in our data centers for the entirety of Etsy's journey until that moment. We had people who were cloud curious, but we really had not leveraged it in any meaningful way until we made that fateful decision in 2017.

Mike: Presumably there was some evaluation of cloud vendors. What kinds of things did you think about there?

Fish: One of the important things was partnership. With any vendor, it's, you can spend a lot of time on contract negotiations, and we do, a lot of people do, and try to get all the terms and conditions down, and try to think about all these things that could happen. At the end of the day, things come up that are outside that. What really is going to matter is how they step up to help out and provide the service and care, and all of that. That was, I would argue, at the very, very top of our list. Google happened to be a vendor that really wanted to partner with us.

They took the time to come in with actual engineers, not just sales reps, and learn about the way we worked, the things we cared about. One of which we can certainly talk more about is sustainability, and how important that was to us, and how we thought about it slightly differently, but they were open to that, and really, said, "We will work with you to come up with ways that satisfy our concerns around that." I think that's the two highlights. Pass to Keyur, he could probably talk more about some of the-- We ended up with a decision matrix of hundreds if not thousands of items. Keyur can probably talk about some of the concerns that he had with vendors.

Keyur: Sure. I think there is something to be said in terms of the GCP offering being, at that moment in time, a much more cohesive set of services put together. I think that was also on our mind in terms of, we were starting with nothing in the cloud, so we had a clean slate, and a more holistically designed offering was more attractive to us. If we had had a ton of preexisting cloud infrastructure, I think the decision matrix may have looked slightly different. Having nothing, it was easy for us to just do an apples-to-apples amongst the various vendors, and see, which is the strongest package put together as a whole. For us, Google won out.

Fish: I'm glad we've got Tim here as well because Thoughtworks was another of our vendors and third-party that was part of our journey. The same thing I would argue that I just talked about with Google, being just a great partner, Thoughtworks has also been along this journey with us from the very early days. I think Tim was probably joined maybe, I don't know, two or three months after me, maybe not even that long, and was with us, him and the entire Thoughtworks team, through this journey. It's the same thing. They shared a lot of our culture, they shared a strong engineering team. All of this really aligned. Certainly technical skills and competencies matter, but also that we feel comfortable with someone that shares our mission, purpose, values, things like that, really, really important. Thoughtworks definitely hit the mark on that.

Mike: Awesome. Thanks for saying that, Fish. Just to come back to sustainability for a minute, that's an interesting topic today more than ever, but you folks were thinking about it back in 2017. Is that, you were concerned about cloud emissions kind of a thing? Tell us a bit more about that. It sounds like that also connects a little bit with Etsy's mission as a company.

Fish: Yes, it absolutely does. Keyur can talk even pre-2017 about how much the company cared and the efforts that went into sustainability, not only just for compute but from the office space, and actual commuting, and things like this that the company really cared about. There's a pretty cool story about offsetting the shipping. This happens several years into my tenure at Etsy, and the company looked around and said, "One of the things that creates the most emissions is actually shipping our products from the sellers to the buyers.

How can we help offset that and do the right thing as a company?" We looked into it and said it's going to cost a little over a penny per shipment to do this, to buy carbon offsets.

We said, "Let's just do it. It makes sense. It's the right thing even though it's going to cost the company, it's just a pure cost." We did it, and we ended up putting a little banner on the checkout page that said, "Etsy is going to pay the cost for offsetting this. Enough people saw the banner and cared about it to increase conversion enough to pay for that program."

That to me is just exactly the type of company that Etsy is, that they didn't go into it thinking, "Yes, this is going to be profitable for us." They went into it thinking, "This is the right thing to do for what we believe in." It turns out absolutely great, not only for the planet but for the business as well, which is just a huge win-win. I think that's the type of mindset that we took into the cloud. One of the big concerns was not having visibility into the energy consumption for compute and storage and everything that we have in the data centers. We, as data centers, often built by that, and so you have that exact data.

We are concerned about that being obfuscated in the clouds and so forth, and so we ended up, again, after the migration a team took it upon themselves to come up with a concept called Cloud Jewels, which is a measurement that tries to determine closely what type of energy we would be consuming so that we can work both to reduce that and to offset that.

That's another case where we're really proud to partner with Google on that because Google could provide some visibility and share that with other customers about the way we're thinking, which allowed us to have a much bigger impact than just Etsy.

The same thing I would say about Thoughtworks exactly, that they also took this concept, and they turned around and created that for not just Google GCP and Google's cloud, but other cloud providers. They also allow us to amplify our voice and our concerns. Yet again, just another example of the partnership that we saw both with Thoughtworks and Google on something that we cared really a lot about.

Mike: I know, Tim, you've been somewhat involved with that. Can you tell us briefly a little bit of the story of that?

Tim: Yes, I can a little bit. Etsy were very generous, and they published a lot of how they were calculating the carbon footprint. Then essentially, we have a sustainability program at Thoughtworks, and we were wanting to do the similar thing. We collaborated and essentially used those algorithms and put that into a node library for calculating a carbon footprint. It integrates with the cloud. Since then, as Fish said, we've onboarded different clouds, and actually, now, it's part of a backstage plugin. Anybody that's using a backstage can use the goodness that Etsy came up with.

Mike: I also think it's interesting to see that that kind of stuff is now becoming a first-class part of most of the cloud vendors' dashboards as well. To me, that's actually there's a direct line from that story to having an industry influence so that everybody is getting that. Just on the migration just a little bit more, I know when we were talking about this before the recording, we talked a little bit about the lift and shift versus making something cloud-native and then migrating it. Keyur, you were the brains of the operation as I understand it there. Can you tell us a little bit about the trade-off between those two and the things that you learned?

Keyur: Sure. I think just to set context for everybody, Etsy chose both paths. We lifted and shifted some things, and then we migrated other things in a cloud-native manner. The decisions were very much based on business requirements. I think Fish, as he had mentioned earlier, alluded to the fact that we had a business need to make sure that our website could scale so the user-facing side could scale up with the growth that we were hoping the business would generate. That was critical. We did not want to invest more in buying hardware in the data center to do that because we knew we were getting out.

That was our highest priority. In that case, we prioritized speed, and we moved the Etsy Web side of the house as we call it, the PHP LAMP-based architecture, in a lift-and-shift manner. One other reason for picking that was also the expertise on the team. The team felt a lot more comfortable with primitives that were similar in the cloud as they were in the data center. On the flip side, we moved our search engine and migrated it in a much more cloud-native manner. That team was already well on the path towards Kubernetes, and Google came along and offered a fully managed Kubernetes offering in GKE.

We ended up prioritizing migrating to Kubernetes, hosted first in the data center. Moving the LAMP stack up created spare capacity so we were able to give them as much capacity as they needed in the DC to migrate to Kubernetes. Then we moved from an self-hosted to a managed offering in the cloud as a second step. The interesting trade-off there was we traded off a little bit of cloud learning in pulling this migration off because the primitives that we were familiar with in the data center in Kubernetes were not always available to us in the cloud.

The failure modes were different. Networking was not as reliable. There was that learning curve for our search team and the SREs there. That's a thing that you need to keep in mind, where if you're wholesale rearchitecting your application, then that is a risk that you take on, so you think about it and go with your eyes wide open.

Tim: What was interesting to me was the way that Etsy approached the cloud migration, which I think to you is just like the logical way that you approach all your software problems, which was like, how do we do a test to make sure that it works? It really felt that you were applying an evolutionary approach. Maybe you might even say an MVP kind of like, can we actually get the website running with the very basic? Then the other thing that I found was quite interesting was you gave a lot of control to individual teams to decide, the capabilities that they owned, how they actually wanted to progress them to the cloud, and what made sense based on what the maturity of SaaS and cloud services were. That was interesting, but I suspect that was probably what just came out as logical to you guys.

Fish: I do think the autonomy of teams is something that just part of the Etsy culture, and it's important it's continued. It's how all of the product teams, and info teams, and enablement teams still run today, is trying to push that down to the lowest level as possible. They're the ones who have the right information, and they're the ones who can make that best decision. Then we do a lot of trying to share that knowledge as Keyur was talking about. Like, there's pros and cons to both approaches. If one team learns something about either the network or managed service or something, trying to share that with other teams.

When Keyur was talking about the migration for Etsy Web at the main marketplace, it reminded me of the story of that. He alluded to the fact that we didn't want to have to buy new hardware for that year, but because the hardware process, to purchase, depending on the order, to get it made, to ship, rack, stock, implement everything was really months. We have a very busy holiday season that had to start no later than, I think it's the end of August, third week from August, something like that. The team, when they initially looked at this, this was, we've signed the contract in December of 2017, so they looked at it and said, "It'll take a year, but that means we're going to have to buy--"

At the time it was like $5 million, $6 million worth of hardware just for Etsy Web to run through the holiday season. Then they looked at it, and they said, "There's a chance we can do this faster." The team really doubled down and said if we could migrate Etsy Web to the marketplace by August, we could save all that money, not have to spend the hardware.

That was the plan, that came to be the plan. We ended up, I think it was we started the second week of August or something, on a Sunday night, migrating. Got everything migrated over, started taking internal transactions to test.

Then ran into a problem and ultimately rolled back that evening. Great success, but not quite there. Then we shook it off. I tried to send people home after being up for 24 hours, they wouldn't even have it. They're like, "We're solving this." They had it solved within hours, and so we said, "Okay, let's do it again." That Tuesday night, and we're days away from just saying we're pulling the plug for the season and ordering the hardware, and we're going to stay in the data center through the holiday season, we did it again that night. Again, all night migrating, got down to minutes again with, of course, as these things go, finding errors and fixing stuff.

Early that morning in August, we started taking live transactions. We're in the cloud and we've been there ever since with the marketplace. We avoided having to purchase the hardware, avoided all of that wastefulness, and the team just did an amazing job. My point being to that story is it's not a straight path. There was a lot of curves and bends to that. The team is just such an amazingly strong engineering team that once they got it in their mind, they were just heads down and relentless on that, and did an amazing job.

Keyur: Fish's story reminded me, which is unusual for tech companies, we did take a maintenance window in order to execute the cutover. The site was down for four hours on Attempt 1 and four hours on Attempt 2, so there's eight hours of downtime. I think it was the right decision because it de-risked the effort a ton. Nobody was having bad experiences. We had put up a very nice-looking banner saying, "Hey, we're doing something important. We'll be back. Trust us." Then the other thing that I just remembered was we ended up making some decisions along the way because we were working with such a tight timeline, that aligned with the skillset of our engineering team and what we could support operationally.

I think we realized that we were cutting over to the cloud right before our busiest season, and we wanted our SREs to have all the information, everything that they needed in order to successfully execute Black Friday, Cyber Monday, our peak days of the year, the things that we prepare for all year long. This would be in a new environment. I think that focus of, we need to do this, but we also need to do this in a way that things will work on game day, was a good prompt for the team in terms of how to make decisions and what to prioritize.

Mike: The move to cloud, you already had a strong engineering culture, and I think the culture story is an important part of this, and maybe it's worth digging into briefly, what is the engineering culture at Etsy, and how does that affect scaling? Also, how did the shift to cloud help or change what you're able to do with the engineering culture at the firm?

Fish: I can start. I do think Etsy has always-- I've known Etsy-- I joined in 2017. I actually met Etsy in 2008 right after they were founded and started. Because they were already growing so quickly, they had some scale challenges. That's how I, as a consultant, got brought in. Almost from day one, they had this super strong engineering culture. Then with the leaders that came and went over the years, it just got stronger. Everything from Cody's Craft, the blog, the speaker series, all the open-source projects that the teams have done, really just a super strong engineering culture.

The way I would describe it, the way I would talk to people about how I think about it was our mission is to keep commerce human. You see that every day on the marketplace where Etsy connects one buyer with one seller at a time. In our modern-day mass commoditization, you don't always get that. That is just core of what we do, but if we only did that on the marketplace, I don't think we'd be fulfilling that mission. When you talk about keep commerce human, it also means how we've got to keep it human within the company as well. I think engineering does this in other departments as well, but engineering, just, this is really how I witnessed people living and working, was treating each other as humans.

That people are going to have good days, bad days, they're going to have skills, they're going to need development areas. People are going to need help. I've witnessed so many times people drop everything to help someone else, and it's because we treat each other humans. We care about each other. That is the culture that I think just pervades all of Etsy engineering is this human aspect which brings out all the wonderful things that you would expect of carrying the blameless postmortems. That it's okay, people are going to need development areas. People are going to need coaching and mentoring and all that. It's great that we have like super strong technologists that we do, but really, I think the superpower is that caring culture of treating each other humans and really thinking about that.

Tim: I can tell a story that impressed me as someone working in your environment around that was, I think this maybe speaks a little bit to the learning and blameless postmortem as well. We were learning at the time, I think we were all moving to infra-as-code and Terraform. One of the Thoughtworks had made a mistake in the Terraform and pushed it to production. It did actually cause, not an outage, but it caused some degradation of the search. What was interesting about this is the reaction. First off, because at Etsy everybody helps each other, there was this immediate trying to help solve the problem.

The problem was solved in seconds. That it was just an operational toggle that we could change. Then I think what was most interesting was afterwards, as a consultant, normally when you screw up a little bit, you get called into the headmaster's office the next day. At this one, it's like, I think it's someone from leadership from the path engineering team pinged me, and I was like, "Oh, here we go." They actually asked me, "Is your consultant doing okay?" That was a very surprising thing, very surprising reaction, and I think that that speaks a little bit to the culture of Etsy, and the appreciating that everything is a learning opportunity.

Then the postmortem we discussed, ultimately that Terraform didn't have any safety checks on it, and how do we improve the validation of that. It became a useful conversation instead of a blameful conversation. I think it was an interesting story.

Mike: I think you mentioned some of the other stuff that you've been doing. There was something about product thinking as applied to infrastructure. Everybody talks about everything as a product these days, and I'm not sure everybody really gets it. Actually, it can mean a lot of different things in a lot of different contexts. Keyur, can you tell us a little bit about product thinking at Etsy in the technology department?

Keyur: Sure. I think what you're getting at is what we internally call our enablement layer. The reality is Etsy is, for people who may not be aware, we're one large monolith, which is our PHP. It is deployed as a monolith and works as a monolith, and then we have a bunch of other supporting services. What we found was that not everything that we needed was available from our cloud vendors or as a managed offering that we could purchase. There were some things that were core to our business that we needed to build ourself. Approaching those as products, as platforms that customers, when I say customers, I mean other Etsy engineers are going to use, was really, really important because we didn't want to build an Etsy... [trails off] One of the ethos that has persisted over the years is we don't want to do technology for technology's sake. We want to be solving business critical problems, problems that are hindering something.

It could be the engineer's happiness, their productivity, or even a business goal. By bringing that thinking into how we prioritize and what we work on, on our internal platform teams, is the clearest manifestation of what you were just saying, like product thinking. Currently, we have researchers, we have product managers who focus on our highest impact, highest ROI, like internal platforms, to make sure that what we do for our customer-facing, so, end-user facing, so etsy.com and the apps, we apply the same rigor to the things that we build internally.

The last point I'll make is, through Fish's tenure, I think his perspective was a very good- what's the word?- checklist in terms of what should we be building internally. I think the way that he would phrase it, and Fish, correct me if I'm wrong, was, we don't want to build something that's a commodity. We would much rather pay somebody else to do it. We want to build the things that are additive to our business. By using that rubric, we don't have a ton of internal platforms, but the ones that we do have, we have strong conviction, are like force multipliers for our business, so that helps. Again, not trying to do too many things at once, but the ones where we do, we have all the support that we need to make sure that they're successful.

Mike: One of the things that we sometimes see is that people make a good decision about building something one day because they've looked around and there isn't anything in the market. There isn't an open-source tool, and so it is appropriate to build something, but then a couple of years go by, and you're still maintaining that internal thing. People have a little bit of an emotional attachment to it, some pride in the thing that they've built. In some cases, some people don't end up retiring them at all. Was that something that you ran into and you needed to tackle?

Keyur: Yes, that happens, but I think what helps is periodically revisiting all decisions. I think to use something Fish says, we try to avoid as many one-way doors as possible at Etsy. Every time we make one of those one-way door decisions, that has a ton of scrutiny, but most engineering decisions are two-way doors. Periodically, either through a contract renewal or a re-architecture conversation, we're happy to open those up. I won't list any names, but I can say confidently that over the years, we have migrated a bunch of internal services that are now available as managed offerings.

Especially in the observability space, I think we've got a ton of value out of that for the company because we don't have to manage these large infrastructures that are necessary for a company for scale. We would much rather somebody else do that for us, and we get the value out of building the alerts, the dashboards, the monitors that we need. That is a way that by constantly reevaluating some of these old decisions, I think we've been able to not stagnate in that way.

Fish: Yes, I completely agree that it's very normal. You make a decision one day, you get heads down on the next problem to solve. All businesses have challenges; there's constantly things going on that you're working on, and it's easy to let years slip by without re-evaluating. I think we've gotten pretty good at pausing and periodically reevaluating. Sometimes it's on a much more frequent six months a year schedule. Sometimes it's multi-years, but we're always thinking about — I think about it, and I call it moving up the stack.

I mean that only in that you're getting closer to the customer. Not that there's any more value, but we want our engineers as close to the customers' problems as possible so that we're solving, directly, what is most important for them. In some cases, you may be the world's best at building the database, and that's really important to your customers if you happen to sell databases for a living. If you don't, there's probably someone else that could do that for you, and that frees you up and your engineering time up to go closer, like Keyur was saying — build on top of that database.

Whether that's using someone else's logging and observability so that we can build the alerts on top of it, or someone else's other service so that we can then build features on top, we're constantly thinking about that of like, how do we get stuff that our customers really care less about, and get towards the stuff they want more. Sometimes you can't, sometimes there's not a vendor out there. Etsy was certainly early enough that they had a lot of that where they had to build stuff themselves. I think we've been good at keeping an eye on that and saying, "Okay, that was the right decision."

That was completely the right decision. We're not saying that was the wrong decision, but times change. Technology moves very quickly. We see this, I think today the biggest shift most people would say is around machine learning and AI, that when just a couple of years ago, we're standing up our own infrastructure because it wasn't there. Fast forward. Now there's infrastructure within the clouds that is there, but we're building a lot of our own models. Fast forward today we could use some of these models for computer vision and stuff, that didn't even exist just a couple of years ago. It was the right decision then, but now we're reevaluating that stuff, but it's moving so quickly, as everyone knows, that you've got to stay on top of it to try to make those timely decisions.

Mike: That's really interesting as well. Do you have a sense of when you made those shifts, how much, I don't know, effort, cognitive load, it freed up your teams from managing a thing yourself to using a cloud service? Do you have a sense of that?

Keyur: I think Fish has a number for the overall cloud migration, which I think is pretty impressive. I'll let him speak to that, but the trade-offs are that, just to call it out, when you do use a third party, you lose any internal visibility into the system. You must be aware that now you're basically reduced to opening a ticket for this third party saying, "Hey, I need help. Something is wrong." I think in terms of the efficiency gained, that one is a little bit tougher to map because the work changes shape, so instead of managing infrastructure, for example, on the observability space, we've instead now, in Fish's parlance, moved up the stack.

We moved closer to the product engineers who are leveraging these observability tools. Instead of being infrastructure SREs managing a ginormous cluster of logging machines, we're now a team of observability consultants who help our engineers make sure that their code is properly instrumented, has all the right tracing, all the metrics that they might need in order to troubleshoot and debug. I wouldn't say that there was any efficiency gained from a people standpoint. It was just that the work changed, the work morphed into something that we thought was more valuable to us, which is this latter piece.

Fish: Yes. In large numbers, pre-migration, we were spending about 60% of our engineering effort on infrastructure and about 40% on product development. Post-migration, that number flipped, so that was great. That was a good change, but if you then break down the infrastructure numbers, which is about 40%, it gets even better. What Keyur was saying is, also within that, we actually only have 6% of our engineering effort on what we consider pure infrastructure. The other nearly 36%, 34% is focused on enablement, which is helping our product engineers.

The story is really, really compelling that if you just look at the big numbers, that's great. We shifted a majority of our engineers towards product, great for our customers. Then if you even look at the infrastructure numbers, it gets even better because they're now not managers, Keyur was saying, these really low-level infrastructure pieces, they're working on stuff above that. That's enabling our project engineers to move faster, quicker, have better experiences. It's really about that story. I think that was the real primary purpose of migration is to get the engineers as much up the stock, whether they're customers or other engineers, or our buyers and sellers. I think it's been a big success from that perspective.

Mike: I think just to help our listeners understand maybe a little bit of the punchline of the story. This was work that you began in 2017, migrating to the cloud, setting yourself up for scalability. Then none of us realized it was coming, but the pandemic hit. Can you tell us about that moment and what happened?

Fish: Yes, it was, you're right, we started in 2017 knowing that we wanted to grow quickly, but not really having a sense of how fast we could grow. We were very fortuitous that we declared victory on our migration. Everything was migrated in February of 2022. One month later, March of 2022, COVID happened. We went into lockdown. We collectively held our breath for a second. Quickly, the CDC gave guidance that cloth masks should be used if there's no alternatives because all of the protective gear needed to be used for our healthcare workers and frontline workers, and so, overnight people turned to Etsy.

An interesting story about what happened was if you had gone to Etsy, say March 1st of 2022, and searched for face masks, you would've probably gotten either a deep cleansing scrub or something for your face, or a Halloween mask. You would not have gotten a protective piece of cloth that you could wear. There was two big issues with that. One was on the seller's side. We might have had like three sellers at the time selling these protective masks. The demand was way outpacing that so we had to reach out to sellers that they had the capability, had the skills to make masks, to say, "All right, please stop making what you're making and pivot. We have a huge demand."

That was a big effort to do that. We then, on the engineering side, had to retrain search. By this point, we're using machine learning in our search algorithms, and we actually took, this is having humans provide the training data, we took a bunch of humans and had them classify images that were of these mask and product listings. Fed that in overnight to our machine learning models, and b the next day, the search results were where we wanted them to be, towards protective mask. Then we had the issue eventually of overwhelming our sellers with orders. We had to work on engineering to be able to rotate sellers, bring them in and out of search because you could be in the top search results and get 1,000 orders in a minute or in a couple of minutes. It was-

Mike: Wow.

Fish: -happening that quickly. Back to, I think your initial question of that doubled our traffic overnight. Having finished very, fortunately, the migration just a month before, we were able to scale, and our infrastructure and our teams didn't miss a beat. We didn't experience any downtime from that. Everything scaled just as we had planned on it, and we didn't know it at the time, but it worked out really, really well for the company and for us. Sometimes it's better to be lucky than good, I guess.

Mike: [beginning of aside] Attentive listeners will notice that Fish misspoke and referred to 2022 when he actually meant 2020, the depths of the pandemic for when Etsy's traffic took off. He was in such a good flow that we didn't want to interrupt him and correct that. It was 2020 when everything took off. Thanks. [end of aside] There is always a question about moving to cloud from an on-prem data center and will you get the benefit and all that stuff. I think you've talked about the non-monetary benefits already in terms of the scaling, but that particular happening, doubling your traffic, you just couldn't have done it on traditional infrastructure. You would've fallen over, right?

Fish: 100% we would have fallen over.

Keyur: I think, if I may interject, there is another angle here that we didn't touch on that might be interesting to the listeners, which is the fact that we were able to double in the cloud with, I'll simplify and say two lines of code, we changed the Terraform config to say, this pool of machines goes from whatever, X to 2X, and hits Apply, and we're up and running at 2X capacity. In the data center, one of the things that would've been challenging, leaving aside all the procurement, would have been that level of API support to stitch everything together in order to provision the machine, is lacking.

Etsy, for many years, was on the path to inventing that internally when we were still on the data center journey. I think that is a value edition of the cloud that I don't think gets talked about enough, where it is, all the APIs that are put together to manage your infrastructure is I think one of the key wins that come from migrating to the cloud.

Tim: In addition to, we've basically been talking about technology and cloud, and I know that scaling Etsy was also about people, product, process, and we wrote about in the article, but Fish, do you want to talk a little bit about what Etsy did around improving product delivery to help scale as well?

Fish: Yes, absolutely. You're exactly right. When I think about scaling an engineering organization, I think about people, process, and technology. You've got to constantly be thinking about each of those. I visualize them as dials that you're turning, sometimes two at a time, sometimes one at a time. You're always thinking about how do we need to improve in these. It's a balance. If you scale up your technology too much too quickly, it's a waste of money. As Keyur was saying, if we had doubled our infrastructure before the demand was there, which we would've had to do with your ordering hardware, it's a waste of money.

If you can do it just in time, then it's the perfect attribution of it. The same thing with process. If you put more process than you need at the time, you create bureaucracy, and nobody wants that. We want just the right amount of process for the maturity of the company, the size, the regulatory compliance, all that. One of the things that we thought about was how do we scale our process as we're growing the engineering team. This is another example of how we partnered with a trusted vendor, in this case with Thoughtworks, to help us with that.

That the team I mentioned, they shared so much of our culture. We actually had some former Thoughtworkers on our staff already because the culture and there's so much similarities around the strong engineering emphasis that they felt right at home. When we brought Thoughtworks back in to help us with the migration, we also realized this next phase is we need help with the process, and so we went through a period of examination of thinking about Agile. I'll say something a little controversial. I think I've said this publicly before, though.

Agile is a failed methodology. I say that because people treat it as if I just do a standup or a burndown or something, I'm agile. That's not what was intended. If we go back to the Agile manifesto, it's a set of principles. I think Thoughtworks agreed with us on the core of this principle: that that is what really makes a team agile and that's important. We spent many months together working through this, and eventually what came out of that was what we called a product development culture. The famous saying “culture trumps strategy” — culture also trumps process.

Instead of dictating that every team is going to do this flavor of Agile or that flavor, we said, "These are the principles. This is what you've got to work on, work towards." Then we ended up giving them tools, a bunch of different tools that they could use. We said, "This is really about the culture. These are what you've got to follow." I think that was, along with it, just as important as the migration to the cloud in allowing us to scale up the engineering teams and deliver great product for our buyers and sellers. I think that what we call the PDC was that important.

We've actually, just last year, looked at it again and decided that we're calling it principles, getting really back to the core of Agile and cloud project development principles. It's the same concept that five years ago, something Thoughtworks helped us put in place, that's been at the core of how we work every single day is around this culture and principles.

Mike: Okay. I think on that note I will say thank you to our guests here, Tim, Mike, and Keyur. Thank you so much for being on the podcast today. Super interesting story about scaling Etsy. There are a couple of articles that listeners can go look at for more info. They're on the martinfowler.com website, and we'll link to them in the show notes. If you're listening and you've enjoyed the podcast, please give us a thumbs up, a review, or a five-star rating on whatever podcast app you're using to listen to this because that really does help us broaden our reach and help people find the podcast. Again, thank you so much, Mike, Keyur and Tim.

Tim: Thanks.

Keyur: Thank you for having us.

[Music]

[END OF AUDIO]

View less

More episodes

Episode name

Published

Themes from Technology Radar Vol.31

October 17, 2024

Build Your Own Radar: Using the Technology Radar as a governance tool

October 03, 2024

Exploring DuckDB: A relational database built for online analytical processing

September 19, 2024

Software service granularity: Getting it right

September 05, 2024

Measuring developer experience

August 22, 2024

How can AI support designers?

August 08, 2024

Sensible defaults: A way to think about our technology practices

July 25, 2024

Tracking technology stacks, practices and experiences across teams

July 11, 2024

Inside Bahmni: An open-source digital public good

June 27, 2024

How to assess your organization's security maturity

June 13, 2024

Continuous delivery vs. continuous deployment: What should be the default?

May 30, 2024

Themes from Technology Radar Vol.30

May 16, 2024

Building at the intersection of machine learning and software engineering

May 02, 2024

Refactoring with AI

April 18, 2024

How to measure your cloud carbon footprint

April 04, 2024

Technology through the Looking Glass: Preparing for 2024 and beyond

March 21, 2024

Diving head first into software architecture

March 07, 2024

Exploring the building blocks of distributed systems

February 22, 2024

Software-defined vehicles: The future of the automotive industry?

February 08, 2024

Beyond the DORA metrics: Measuring engineering excellence

January 25, 2024

Asynchronous collaboration: Getting it right

January 11, 2024

Looking back at key themes across technology in 2023

December 28, 2023

Leveraging generative AI at Bosch

December 14, 2023

Jugalbandi: Building with AI for social impact

November 30, 2023

AI-assisted coding: Experiences and perspectives

November 16, 2023

What's it like to maintain an award-winning open source tool?

November 02, 2023

Engineering platforms and golden paths: Building better developer experiences

October 19, 2023

Managing cost efficiency at scale-ups

October 03, 2023

Exploring SQL and ETL

September 21, 2023

Driving innovation in radio astronomy

September 07, 2023

XR with impact: Building experiences that drive business value

August 24, 2023

Leadership styles in technology teams

August 10, 2023

Making design matter in technology organizations

July 27, 2023

Generative AI and the future of knowledge work

July 13, 2023

Scaling mobile delivery

June 29, 2023

Making privacy a first-class citizen in data science

June 15, 2023

Multi-cloud: Exploring the challenges and opportunities

June 01, 2023

Scaling up at Etsy

May 18, 2023

TinyML: Bringing machine learning to the edge

May 04, 2023

The weaponization of complexity

April 20, 2023

How we put together the Technology Radar

April 06, 2023

Inside India's Drug Discovery Hackathon

March 23, 2023

Serverless in 2023

March 09, 2023

My Thoughtworks journey: Rebecca Parsons

February 23, 2023

How to tackle friction between product and engineering in scale-ups

February 09, 2023

6 key technology trends for 2023

January 26, 2023

Tackling system complexity with domain-driven design

January 12, 2023

Shifting left on accessibility

December 29, 2022

Data Mesh revisited

December 15, 2022

Low-code/no-code platforms: The 10% trap and the limits of abstractions

December 01, 2022

Welcome to the fediverse: Exploring Mastodon, ActivityPub and beyond [Special]

November 24, 2022

Rethinking software governance: Reflecting on the second edition of Building Evolutionary Architectures

November 17, 2022

Reckoning with the force of Conway's Law

November 03, 2022

Exploring the Basal Cost of software

October 20, 2022

Why full-stack testing matters

October 05, 2022

Acknowledging and addressing technical debt in startups and scale-ups

September 22, 2022

XR in practice: the engineering challenges of extending reality

September 08, 2022

Agent-based modelling for epidemiology: EpiRust and BharatSim

August 19, 2022

Mastering architectural metrics

August 12, 2022

Building a culture of innovation

July 28, 2022

Starting out with sensible default practices

July 14, 2022

Better testing through mutations

June 30, 2022

Patterns of legacy displacement — Part two

June 16, 2022

Patterns of legacy displacement — Part one

June 02, 2022

Mitigating cognitive bias when coding

May 19, 2022

Following an usual career path: from dev to CEO

May 05, 2022

Software engineering with Dave Farley

April 21, 2022

Tackling bottlenecks at scale-ups

April 07, 2022

Coding lessons from the pandemic

March 24, 2022

Is there ever a good time for a code freeze?

March 10, 2022

Navigating the perils of multicloud

February 25, 2022

Compliance as a product

February 10, 2022

The big five tech trends for 2022

January 27, 2022

Fluent Python revisited

January 13, 2022

Creating a developer platform for a networked-enabled organization

December 30, 2021

The art of Lean inceptions

December 16, 2021

The hard parts of data architecture

December 02, 2021

TDD for today

November 18, 2021

You can't buy integration

November 04, 2021

The rise of NoSQL

October 21, 2021

The hard parts of software architecture

October 07, 2021

Machine learning in the wild

September 24, 2021

Delivering innovation at scale

September 09, 2021

Jim Highsmith: a 54-year agile journey

August 26, 2021

Securing the software supply chain

August 12, 2021

Making retrospectives effective — and fun

July 22, 2021

Patterns of distributed systems

July 08, 2021

Refactoring databases — or evolutionary database design

June 24, 2021

Making developer effectiveness a reality

June 10, 2021

Team topologies and effective software delivery

May 20, 2021

How green is your cloud?

May 07, 2021

Green software engineering

April 22, 2021

Twenty years of agile

April 08, 2021

Talking with tech leads with Pat Kua

March 25, 2021

My Thoughtworks Journey: Patricia Mandarino

March 11, 2021

Exploring infrastructure as code

February 25, 2021

XR in the enterprise

February 11, 2021

Getting to grips with data visualization

January 21, 2021

Computational notebooks: the benefits and pitfalls

January 07, 2021

The architect elevator

December 24, 2020

The future of Clojure

December 10, 2020

The future of digital trust

November 27, 2020

Integration challenges in an ERP-heavy world — Pt 2

November 12, 2020

Democratizing programming

October 28, 2020

Integration challenges in an ERP-heavy world

October 16, 2020

Models of open sourcing software

October 01, 2020

Applying software engineering practices to data science

September 17, 2020

Using visualization tools to understand large polyglot code bases

September 03, 2020

Machine learning in astrophysics

August 20, 2020

Programming languages geek out

August 06, 2020

Observability does not equal monitoring

July 23, 2020

Working with 50% of code in the browser

July 09, 2020

Realising the full potential of CD

June 25, 2020

Testing the user journey

June 12, 2020

Continuous delivery in the wild

June 01, 2020

Lessons from a remote Tech Radar

May 13, 2020

The future of Python

April 30, 2020

A sensible approach to multi-cloud

April 17, 2020

Digital transformation: a tech perspective

April 02, 2020

IT delivery in unusual circumstances

March 20, 2020

Continuous delivery for today's enterprise

March 06, 2020

Fundamentals of Software Architecture

February 21, 2020

Cloud migration — part two

February 10, 2020

The price of reuse

January 24, 2020

Towards self-serve infrastructure

January 13, 2020

Martin Fowler: my Thoughtworks journey

December 27, 2019

Building an autonomous drone

December 13, 2019

Cloud migration is a journey not a destination

November 28, 2019

Getting to grips with functional programming

November 14, 2019

Compliance as code

November 01, 2019

Data meshes: a distributed domain-oriented data platform

October 18, 2019

Edge — a guide to value-driven digital transformation

October 04, 2019

Tech choices: CIO or CTO?

September 20, 2019

Microservices as complex adaptive systems

September 05, 2019

Supporting the Citizen Developer

August 22, 2019

Getting hands-on with RESTful web services

August 08, 2019

Zhong Tai: innovation in enterprise platforms from China

July 25, 2019

What’s so cool about micro frontends?

July 11, 2019

Unravelling the monoglot monopoly

June 27, 2019

Breaking down the barriers to innovation

June 13, 2019

Delivering strategic architectural transformation

May 30, 2019

Exploring programming languages via paradigms vs labels

May 16, 2019

Multicloud in a regulated environment

May 03, 2019

Can DevSecOps help secure the enterprise?

April 18, 2019

A11Y — Making web accessibility easier

April 04, 2019

Continuous delivery for modern architectures

March 21, 2019

Delivering developer value through platform thinking

March 07, 2019

Architectural governance: rethinking the Department of ‘No’

February 21, 2019

Serendipitous Events

February 08, 2019

Diving into serverless architecture

January 24, 2019

Seismic Shifts

January 10, 2019

Understanding bias in algorithmic systems

December 28, 2018

Microservices: The State of the Art

December 14, 2018

Evolving Interactions

November 29, 2018

The state of API design

November 15, 2018

How we build the Tech Radar

November 01, 2018

IoT Hardware

October 18, 2018

Continuous Intelligence

October 04, 2018

Distributed systems antipatterns

September 13, 2018

Agile Data Science

August 23, 2018

Services

Industries

Resource Hubs

Publications and Tools

All Insights

Scaling up at Etsy

Brief summary

Episode transcript

Find out what's happening at the frontiers of tech