Brief summary
The release of DeepSeek's AI models at the end of January 2025 sent shockwaves around the world. The weeks that followed have been rife with hype and rumor, ranging from suggestions that DeepSeek has completely upended the tech industry to claims the efficiency gains ostensibly unlocked by DeepSeek are exagerrated. So, what's the reality? And what does it all really mean for the tech industry?
In this episode of the Technology Podcast, two of Thoughtworks' AI leaders — Prasanna Pendse (Global Director of AI Strategy) and Shayan Mohanty (Head of AI Research) — join hosts Prem Chandrasekaran and Ken Mugrage to provide a much-needed clear and sober perspective on DeepSeek. They dig into some of the technical details and discuss how the DeepSeek team was able to optimize the limited hardware at its disposal, and think through what the implications might be for the industry in the months to come.
Read Prasanna's take on DeepSeek on the Thoughtworks blog.
Episode transcript
Prem Chandrasekaran: Welcome, everyone, to yet another edition of the Thoughtworks Technology Podcast. Today, we are going to be talking about DeepSeek and everything that is related to it, the implications, what it means for us going forward, and how it seems to have changed the world. I've got Ken Mugrage, who is my co-host with me. Ken, do you quickly want to introduce yourself?
Ken Mugrage: Hi, everybody, this is Ken, one of the regular hosts of your podcast. Thanks for joining us.
Prem: We've also got Shayan and Prasanna who are going to be talking about this really, really fun topic. Shayan, do you want to quickly introduce yourself?
Shayan Mohanty: Yes, I'm Shayan. I'm the Head of AI Research at Thoughtworks, previously CEO and co-founder of Watchful. Good to be here.
Prem: How about you Prasanna?
Prasanna Pendse: I'm Prasanna. I look after the AI strategy for Thoughtworks.
Prem: Very warm welcome to both of you. It's wonderful to have you here. We are looking forward to a great conversation. Today, we are going to be talking about DeepSeek. What exactly is DeepSeek, and why is it that it's relevant and why are we talking about it?
Prasanna: DeepSeek is a startup in China that launched a model that they claimed performs as well as an o1 in certain reasoning tasks. One claim in their paper, the V3 paper, talks about the cost that it took to train this at about $5.6 million, which got taken out of context and spread like wildfire. Everybody now assumes that anyone can start from scratch and train a model under $6 million and beat OpenAI. That's why it became extremely popular.
At the same time, the app works well and people started using it, and it's free. You don't have to pay the $200 a month for the pro mode as you do on OpenAI. People have started adopting it and essentially the app climbed the Apple App Store charts because of how its adoption was very fast. That has also attracted attention from the media and everybody started talking about it. Then of course, this is sitting in the middle of a geopolitical situation that creates some additional heat, if not light.
Prem: Anything you want to add to that, Shayan?
Shayan: No, I think that basically does it. Again, it's worth noting that DeepSeek didn't make any particularly outlandish claims in their papers. I think a lot of this was taken out of context by people who might have skimmed the papers, and then attached the specific talking points and then spread those. I think it's good that we're taking a more critical view to this whole thing and actually looking at it for what it is.
Prem: Right. Many seem to be comparing DeepSeek, obviously, with companies like OpenAI, Anthropic, Google, and so on. On the one hand, each of these companies provide commercial offerings, and then there is also companies like Meta and Ai2, for example, who also offer a bunch of open-source offerings. If you were to compare both of these, how would you do it? Where does DeepSeek sit in the context of both kinds of solutions?
Prasanna: For DeepSeek's model is post-trained on top of both Llama and Qwen. Qwen is Alibaba's open-weight model, I should say, and Llama is Meta's open-weight model. Whatever was done to train Llama and Qwen is a prerequisite for DeepSeek to get started. That's first step. DeepSeek specifically has multiple models. The one that is getting attention is the R1 model, the reasoning aspect of R1, which at one point of time we had talked about AIs cannot reason and all of that. Now we're looking at multiple reasoning models come out from OpenAI as well as from DeepSeek.
I think that's what is a little bit different, is that it sits on top of it. They apply a suite of techniques in the post-training space. It's not a pre-trained model, like Llama is, from scratch, and they do some very interesting, very cool optimization. That's probably the biggest thing that they've done differently, is that they were restricted to the H800s, and that imposed certain limitations, and they have optimized everything from the top to bottom to meet the needs of the constraint that was imposed upon them. They've done a really amazing job, I think, of extracting as much performance out of whatever they had available.
Ken: Prasanna, you mentioned the restricted chips. Can you just briefly, for the audience that's not familiar, describe what you mean by that?
Prasanna: Yes. The US government at some point introduced export controls to prevent the export of NVIDIA's H100s and other similar chips to China. NVIDIA released a version of the H100 called the H800, which essentially is restricted to about half the GPU-to-GPU interconnect bandwidth, as well as a couple of other limitations. I think it has a smaller memory, VRAM, available, and things like that. Those are the chips that are available in China, and that's the hardware that they optimized for.
Ken: One of the claims I read is that DeepSeek, in working around that, for lack of a better way to put it, actually found it to be an advantage in some way. Was that an accurate reading or was that somebody pulling a snippet out to the paper like Shayan mentioned earlier?
Shayan: I wouldn't call it an advantage. There's nothing inherently better about the H800 architecture over, say, H100's. They worked with what they had. They did really clever HBC co-design. They basically designed their architecture, their training regime, all the way down to sub-CUDA manipulations of the streaming microcontrollers on the GPUs to get around some of these issues.
They cleverly overlaid communication between GPUs so that there was very little bandwidth constraint that they had to deal with. It wasn't so much that they leveraged H800s in a way that you couldn't also do on H100s. It's just that there was a necessity that they had to do this giant deep dive all the way down to the hardware level to get around some of these limitations so that they could actually train multi-100 billion parameter models on 14 trillion to 15 trillion tokens.
Prem: Wonderful. Thank you. Looks like we're already on our way to busting a bunch of myths, so let's continue along those lines. A lot has been made about this 5.6 million number. That's apparently the number that was quoted that DeepSeek used to train their flagship R1 model. Is that necessarily true? I know, Prasanna, you mentioned that there are some misconceptions there, but maybe we can dive a little deeper in terms of firstly if it's true, and if it's not, what exactly is the truth there?
Prasanna: I think it's not true. That's the short answer. Shayan will go a little bit deeper into the V3 paper, but I'll just remind everybody that the model that DeepSeek made was built on top of LLama. LLama, in one of their papers, talks about it taking 39.-something million GPU hours, and then the V3 paper talks about 2.788 million GPU hours. Whereas LLama talks about it in terms of a cumulative GPU consumption, DeepSeek paper talks about it in the context of the last training run. There's a lot of nuance to unpack. I think Shayan did a pretty interesting deep dive into exactly what those numbers mean. You want to get into that?
Shayan: Just to be clear, the original DeepSeek LLM was designed on top of LLama and Qwen and that sort thing, but since DeepSeek V2, which was a paper release in June of '24, they actually changed the architecture to be a mixture of experts approach, which is a little bit more bespoke to what they were trying to do. They introduced DeepSeekMoE. DeepSeek V3, which is a paper released in December '24, is the paper that has the $5 million claim. The specific claim that they made is that that number represents the cost of the last training run. The very last end-to-end approach, assuming all iteration is done, assuming everything is perfect, they just hit the button and they run it. They made some assumptions about how much each GPU might cost in the open market.
Practically speaking, obviously those costs are not what they incurred because they actually purchased hardware and it's just sitting in a data center somewhere. Just to be clear, that number does not equate to how much it costs to train R1. Which, while R1 is an identical architecture, they just used DeepSeek V3 and they did some different post-training. It's worth noting that just basic rule of thumb is that reinforcement learning costs about 8 to 10x the same compute requirements as supervised fine-tuning. You can back into some assumptions about how much it probably cost them to train R1-Zero, let alone R1, which had a much more sophisticated training regime.
Prem: What you're telling me is that I can't now just pay someone a check of $5.6 million and go from zero to an R1-like model?
Shayan: Correct. Just to be super clear, in order for them to get to that level of efficiency, they had to have one of the world's most sophisticated teams. Not just from an AI research perspective, but all the way down to high-performance computing and the co-design of all the elements in between. Again, they had to drop down to the PTX level of optimizing how the streaming microcontrollers were used across their GPUs. They had to think about which experts in their mixture of experts architecture were co-located on which GPUs and get them as close as possible to the data. They had to think about how the data flow worked across devices, how to make sure that backprop didn't cause training bubbles.
There were a whole bunch of really sophisticated things that they did, which were very specific to the model architecture that they designed. They had to co-design the whole thing. It's not like you can just pick up that work and then magically everyone's training is now way cheaper. This required an integration of the entire stack and very few teams on the planet are able to do that.
Prasanna: Let's double-click on what that PTX thing is. It's an assembly language essentially underneath CUDA for operating on the GPUs. It's literally at the level of shift-left of this byte and all of that. In order to go from, "Oh, I want to do this deep learning neural net thing," you need to go all the way into assembly level, not a lot of people can do that, to map all of that thing together. Plus, the infrastructure to manage this large cluster and to be able to do all of these runs, observability, capabilities. All of that stuff was pretty good in order to get to this level of sophistication.
Assume that you have a team, and now you say, "Here's $5 million." You still can't just do it because, well, they haven't open-sourced the training code or the optimizations that they have done. They've only open-sourced, as far as I can tell, the inference code of how to take the weights and run it at inference, which is not the same as somebody else just downloading git clone plus $5.6 million, does not equal a model at the end of it because there's a lot of things that are not open-source.
Prem: Thank you. That was very helpful in terms of debunking that myth. Here is another one that seems to be doing the rounds. There are allegations that DeepSeek has used this process called distillation to train its R1 model using data from OpenAI's model, potentially violating their terms of service. Can you shed light on these claims, and can you explain what distillation actually is?
Shayan: Let me just first start with what distillation is. The whole idea is that you take a much bigger, much more sophisticated, much more capable model and you extract that sort of capability out of it and shove it into a much smaller model. The way that you might do that is you provide some set of prompts to the big model, you get it to respond, then you use that prompt plus response pair or sets of pairs, for instance, as training data to a much smaller model, which in theory then learns how to do the thing that the bigger model was trying to do. You circumvent the really big circuitous path that you normally have to take to train a really big sophisticated model, and you can shortcut it and go for a smaller model.
The RI paper talks about some amount of distillation. They don't explicitly talk about how they pulled data out of o1 to then train their own models. They do talk about how they distilled reasoning capabilities from R1 into much smaller architectures like Llama 7B, 70B, and so on. There's that. I can't really speak to whether they did break OpenAI's terms of service or not. What I will say is that at some level, OpenAI went around and they grabbed a bunch of data from a bunch of different sources. They pulled all of that together and they trained a model.
Then they exposed access to that model via an API, where via the API, you can provide a prompt and you can get a response. Perhaps there's a line in their terms of service that says you cannot then take that output and train a model, in which case I'm going to let the lawyers figure out how exactly they want to go after this, but from a technical perspective, it's not like somehow the DeepSeek team infiltrated the OpenAI infrastructure to then extract weights and then use that to then bootstrap their model. It was a lot more benign around if they did this, it's around probing access to the API, providing prompts out of the dataset, and then getting responses and then perhaps using those to train some elements of the model. Is it illegal? I've got no idea. Is it ethical? That's also a gray area. Does it work technically? The answer is yes.
Prasanna: I think the other aspect that is interesting here is, as the models get bigger and more complex, they are hungry for more and more data. There's only so much data to be had, like now with the Pile and a bunch of other open data sources, a bunch of the internet is packaged and available as a dataset. Everybody's converging on more or less the same dataset to start with. Whether the output of one was used or not, the source data is starting to look similar. Therefore, what models are trying to do to differentiate is to then start synthetically generating data to add more data volume.
As you get into that, you will be using some other model, whether it's an OpenAI model or Llama, or some other statistical model that you created to be generating the next set of data that you want to use. One of the challenges in this whole thing is at which point does that essentially pollute every downstream model by having essentially just the output of some other model being used in the next model and so on? At which point is this process going to be meaningless or are there techniques to generate statistically correct, I guess, synthetic data in a way that actually helps the next model be better?
I think that whole area is technically an interesting area, but there is nothing written in the paper about, yes, we used this data from OpenAI. They have not released their data. OpenAI hasn't released their data either.
Ken: I'm going to put you on the spot a little bit. You had a lot of open questions there. Any guesses? What are the answers? Can we do synthetic data in that way? Is that a June thing or a 2029 thing or--
Shayan: To be clear, synthetic data is actively being used in large-scale model training. That's without a doubt. In fact, a fair amount of the R1 paper is about the various techniques that were used to generate that data. In fact, it goes all the way back to the DeepSeek-V3 paper, where the R1 model, which hadn't even been released yet, was actually used to generate some of the data that DeepSeek-V3 was trained on. This is a well-known fact, even outside of DeepSeek. Synthetic data is being generated to create more such data that then can feed these models. It's also worth noting that not all data is made equal. Again, since we're talking about DeepSeek, I'll use the DeepSeek papers to illustrate this point, but this is not only a DeepSeek discovery. This is something that the entire AI industry has known for a while at this point.
In their very first paper, DeepSeek LLM, the point of that paper was to explore the scaling laws and get some sense of what the actual relationship is between parameter count and total tokens being used for training. What they discovered is in line with the industry, which is, it's not about the total token count, it's about the entropy being introduced by those tokens. How different are they? What density of training signal can be extracted?
The concrete example here is that if you had just a ton of really spammy, duplicated data, is probably the right word, data that repeats itself over and over and over again, even if you have 14 trillion of those tokens, you feed it into a model, the model is not going to learn that much no matter how large that model is. Whereas in theory, you could have a much smaller dataset, let's say 1 trillion tokens, 4 trillion tokens, something like that, where it's a good mix of multilingual data, coding, math, whatever.
It's all different, well-curated and so on. You take that, and even though it's a smaller token count, there's a denser amount of training signals that can be extracted and therefore it justifies a larger model architecture. You could use that to feed a fairly sophisticated multi-100 billion parameter model in spite of the fact that it's a lower number of tokens. It's worth noting that there is that relationship. Yes, synthetic data is used, but it's in the spirit of how do you generate more high entropy data? The fact is that the internet, we've exhausted the high entropy sources of data on the internet. Now it's just this exercise of filling in the gaps.
Prem: It sounds like, from what we know of course, they might not have done something blatantly illegal, at least. Most model trainers seem to do something similar in terms of drawing inspiration from data sources that are out there, and they seem to have done something similar.
Shayan: Without commenting about the legality of it, because I think legality is one thing, technical feasibility is another thing, and how technically common is this? I would say that they did something that's technically common. The legality of it is a nuance that just goes back to, what signs did they have to look at and then choose to disregard in order to do this technically common thing?
Prem: What jurisdiction [crosstalk] different jurisdiction.
Shayan: It gets really complicated, especially since we're crossing country boundaries here as well. What is enforceable, what is not? That's the practical case for legality. Without commenting too much on the legality side, they didn't do anything that hasn't already been done, is the point.
Prasanna: Exactly. I guess OpenAI or Meta or any of these other companies may have done the same thing, we just don't know.
Prem: Okay. Let's move on to the next thing that seems pretty important, one is that DeepSeek now offers R1 as an open-source model, obviously, and then they also priced their API access at about $2 per million tokens. Whereas if you compare that with open AI's, one model is approximately $60 per million tokens. That's a staggering 97% lower.
Although while we are talking about this, OpenAI has just released their o3-mini, which is a lot lesser in terms of price. The bigger question that I have is, while all of this sounds really great for accessibility, does the pricing model signal a significant shift in terms of AI economics and what the long-term impact of something like this could be?
Shayan: I want to disentangle a couple things from that. Let's actually start bottom to top and then we'll come back down. At the bottom, does this signal a shift in the inference economics? The answer is probably not. The reason why I say that is because the DeepSeek series of papers, if you trace the arc, it's all about a singular focus on building ever larger language models with minimal costs, memory overhead, training instability, all that stuff. They designed all of this in the spirit of efficiency. It lends itself then to a lower inference cost being seen by the model vendor, DeepSeek in this case. Therefore they can pass that cost savings down to the user.
Now, is it on the same order that it's being priced at, who knows? Practically speaking, there are a lot of reasons why a company might decide to price lower than their costs. Predatory pricing is a thing, especially in highly contentious industries. We don't actually have a good sense of what the real costs are in running this model. That's point number one. Point number two is that while the DeepSeek paper talks about o1 as their primary point of comparison, practically speaking, the models are not exactly apples to apples. They compete on certain benchmarks, so coding is one of them, reasoning is broadly another, but even then, within reasoning, frankly, R1 can't do certain things. It's not good at general knowledge recall as an example.
It's not good at a lot of stuff that o1 theoretically is also good at. That may back into parameter account, that may back into flops per token for inference, for instance, which then backs into costs on vending these models. Practically speaking, it's not really a good comparison. The other thing that I'll say is that at the top of all of this, we talked about how sophisticated this team needed to be in order to eke out the performance gains that they saw. Again, they had to go all the way down to a very, very deep level on the hardware and co-design the whole stack.
That work signals a shift potentially in the way that AI research teams might operate and theoretically could shift the fundamental economics. It's not a drop-in, like now everyone's training, everyone's inference is now cheaper. It detailed a certain way of working a certain team structure that might be able to achieve that. Long story short, I don't think this signals any massive shifts. If anything, we're seeing two highly contended companies trying to outprice one another and out-innovate. That's good for the market overall. I don't think it signals anything like it's staying power.
Prasanna: Yes. I think even if you look at what DeepSeek claims about their model, they basically say that they've paid attention to English and Chinese. They really haven't paid attention to the larger language set in the world. There will be other aspects. Essentially, there's going to be more and more diversity in how people use it and so different models will be useful for different reasons. I think from a technical perspective, digging into the optimizations and how they've done optimization from inference perspective is very interesting, but it's going to be challenging before it becomes widely available for anybody and everybody to adopt those particular optimizations. We also keep calling it open-source. It's not open-source, it's open-weight and the model is open-weight, and the source for-- They haven't open-sourced the assembly code that they've used to optimize for the interconnect bandwidth that they have, that's not available. It's not open-source in a way that other people can piggyback on top of this-- They've published the paper in the open to say that this is what we did, but they haven't shown the code of how they did it.
It's going to be some time for somebody to read the paper and figure out what they did, and then write that code and open-source that code. Some of these optimizations will make their way, but I agree with Shayan that it's not necessarily a fundamental shift at that scale.
Prem: The reality is that they do only charge $2 per 1 million tokens whereas OpenAI's costs are much higher. Are they just trying to play the market? Are they trying to slash prices in a manner that is not sustainable, or do they have something real here? That really was the question, but it's a pretty significant reduction in cost.
Prasanna: If you remember even a few months ago, we were talking about how OpenAI itself is pricing so low, so much below their actual cost that they are also subsidizing the market by offering that. I think DeepSeek is probably following the same playbook, but by how much, it's unclear. We don't know exactly how much cost they actually incur. Surely, the labor cost is probably less, if I were to assume, but this team is highly skilled. I would doubt that they're going to be much less in terms of the labor cost side of things.
The infrastructure, the H800s that they have access to are cheaper than the H100s, but also only nominally cheaper, because what ends up happening is because of supply constraint, the actual price to buy an H800 in China right now is significantly higher than the price to buy an H100, even the original list price is a little bit lower. I think these are commercial decisions more than technical decisions, so I'm not sure we have any deep insight into how they decided that.
Ken: You talked briefly there about open-source versus open-weight and so forth. DeepSeek just in general, when I think of open-source, I think of something that you can download, you can build it yourself, you can run it on your hardware, et cetera. These types of things, even if they were truly everything is out there in the open, isn't just the ability to run it a barrier to entry? Are these truly open where people can contribute?
Shayan: Can you run it? The answer is yes. Can you reproduce their papers? The answer is no. I think that's the strictest definition in open-source versus open-weights. They basically took an asset, and this is not specific to DeepSeek, even Meta does this. Llama is an open-source model where you can see the architecture, you can see all of that stuff, but they don't provide you the data that it was trained on. They don't give you the training scripts that was used to train it. None of that stuff is readily available.
Practically speaking, you get an asset that you can then run on your own and you can do stuff on top of it, but the underlying actual mechanics, meaning the data it was trained on, the specific training regime that was used to train it and so on, is unavailable to us. DeepSeek followed that exact same playbook.
Prasanna: Even if you're trying to download Llama off of Hugging Face, you need to sign an NDA. You need to agree to some terms from Llama. It's not copyleft. Everything is available to everybody. It's not that kind of a thing. It's just that you can download the weights if you agree to the terms.
Prem: Isn't it less restrictive than even Llama, for example? Right now, they have released it under an MIT license, which basically means that I can take the model and I can use it for pretty much any purpose that I see fit because it sits MIT license, whereas Llama, for example, is a fairly restrictive license where I cannot use it for anything commercial. Beyond something research-based, I can't really use it.
Prasanna: Yes, but it's still the weights. They haven't released the data that this was trained on. They haven't released the code that was used to train it. They haven't released the code for all the optimizations that they did. If you were to start, you had $5.6 million to spend and you hired the best team, you still are not starting on day one. You're going to go figure out where the data is, you have a couple of years of work to do to replicate all of that.
Prem: Thank you very much, Prasanna and Shayan. That was really, really enlightening, if you ask me. Lots of myths that you busted for us and our listeners. We come to the end of this episode today with a lot of insights that we have derived, and maybe a couple of months from now, we might come back and record a part two of this conversation to see exactly where we ended up, because even now, it's fairly novel in terms of us absorbing what's going on in this space. Thanks a lot and I look forward to having yet another conversation on this topic very soon.
Shayan: Thanks so much for having us.
Prasanna: Sounds good. Thank you so much. See you next time.