Brief summary
Computational notebooks — such as Jupyter and Databricks — have soared in popularity with data scientists thanks to the ease with which text, visualizations and code can be combined on a living document. But what works for the data scientist doesn’t always fit with developers’ needs. Productionizing notebooks is fraught with perils. Our podcast team explores how to use computational notebooks most effectively.
The two articles mentioned in the discussion can be found here:
Don't put data science notebooks into production
Podcast transcript
Neal Ford:
Hello, welcome everyone to the Thoughtworks Technology Podcast. I'm one of your regular guests, Neal Ford, and I'm joined today by another of our regular guests, Zhamak.
Zhamak Dehghani:
Hi Neal.
Neal Ford:
And we are here to talk today with two Davids about computational notebooks. So to try to keep this straight in people's mind, we're going to refer to them as Dave, which is David Tan. So let's hear his voice.
David Tan:
Hey everyone.
Neal Ford:
That's who you identify as Dave. And we also have David Johnston.
Neal Ford:
Hello everybody. So I'll let David give us his background at Thoughtworks. And then Dave can give his background and we'll talk a bit about computational notebooks. So David?
David Johnston:
All right. Thank you, Neal. So I'm a data scientist at Thoughtworks. I've been here around eight years now. In my first career, I was a cosmologist and I guess I first started working with notebooks when I was in college. They were the Mathematica notebooks. And since I've used notebooks a little bit, the Jupyter notebooks of the modern era. But yeah, I've worked in an article recently on the Martin Fowler blog called "Don't put Data Science Notebooks into Production." So perhaps you can see where I'm going with some of my arguments from the title. But yeah, I am happy to speak to you guys today. Thanks.
Neal Ford:
Okay. And Dave?
David Tan:
Hey everyone, I'm Dave. David Tan. You can come in Dave. I'm a developer at Thoughtworks. I joined about four years ago and I moved from Singapore to Melbourne this March just before COVID blew up. And yeah, I'm in Thoughtworks, I'm fortunately in the position between kind of data science and software engineering. So I'm lucky to be in the sweet spot. So I've written some articles and talked about coding habits for data scientists, how we can take kind of solved problems in a software engineering world, apply them to the problems and the pain they're facing in the data science world. And so yeah, it started with kind of like a workshop that became an article and became a kind of a video series about coding habits for data scientists. So there is a demand in this space I sense, like people generally say oh, I wish people watched this video. So I'm happy to be here to share about how we can bring these solved problems into the data world.
Neal Ford:
Right. And as you've guessed, what we're talking about today are computational notebooks. These have become very popular tools in the data science world, and other parts of the software development ecosystem. For those who are not familiar, the original idea as far as I know, came from famous computer scientist, Donald Knuth, who came up with this idea in the early 90's or late 80's, early 90's, about literate programming, this idea that it's a shame that you can't actually read source code. So he invented a platform and created at least I know C Web and Pascal Web, which are languages that would let you write prose and code intermingled together.
Neal Ford:
And you run it through one compiler and it produced documentation. And another produced executable code. That of course grew into Mathematica and Jupyter, and I'll let one of the other more knowledgeable people take up the history of this style, because it has become popular in the data science world for obvious reasons. The ability to intermingle documentation about something with the ability to execute something inline and get results right away and be able to, for example, play with parameters. Did I summarize that correctly?
David Johnston:
Yeah, I think so. Yeah. So I will add a few things. So in a notebook, it starts out where you have a terminal, right? We always had terminals and you would run a command to the terminal. You could run a program and it would print out some texts at the end, right? So that's the first stage. And the next stage is you might want to see, for example, the texts that you see formatted in a nice way, right? Like a table. And there are ways to make a table show up in texts so that you can actually read it. But then there's graphics. When you're doing data science or exploration of data, you want to make things like plots. And originally, you make a plot, it would be saving it to a file, right? And then you could open the file in some kind of file reader.
David Johnston:
And then there's things like windowing systems, like Map Plot Lib, or you can make a plot and pop up a window in the windowing system of the operating system. And then when the web programming became a big thing, everyone wanted to work on a browser. And we would like our plots to show up in browser. And there are libraries, for example, in Python, where you can get plots to show up in the browser tabs, for example. But in the notebook, everything kind of showed up in one place. You have one window where you can type the code in. You can see, of course, the textual output. If you have a table, for example, which like a PANDAS table, you can print it out and it will format in the browser, in the window that you're in, which is in the browser, in a nice way, right?
David Johnston:
Which means it looks like a table that you see on the web. It has lines on it and headings and everything. It looks nice. And finally, you can make a plot and a plot will show up right there, right after the command that you wrote, right? So that you have a linear sequence of the code that you run, the output formatted in a nice way, as well as the plots showing up in the browser, such that you can scroll up and down and see all the results in that way. So it's just simplification over having your files and your graphics show up in different places and having to pull them back together. Everything happens in one place, in one tool.
Zhamak Dehghani:
I think Martin wrote about this. And he used the phrase "illustrative programming." Programming languages, like Excel, you write a little piece of code and an illustration of that execution of that program on data, as an example, can be shown immediately. And you get the feedback, which seems to be quite powerful when you are exploring your learning. So I'm curious to see how you're using this tool in your workflow as a data scientist. And what flavors of that you've seen in terms of how people are using notebooks.
David Johnston:
So the whole thing about notebooks too, is there are cells, right? So if you run a command, you want to rerun that command. You can push a button on the, it will rerun that cell and you can rerun the next set of cells if you want as well. Whereas previously again, in a terminal, you'd hit the up arrow, up arrow, up arrow, until you get to your command, run that command, and then do that again to run the next command. It's a much more visual way of doing this. And it allows you to, let's say that you run some command to run machine learning and you get a plot to see the visualization of how well it did. Okay. Well now you want a different parameter. You want to run it with beta equals four instead of beta equal two. So you can go up to that cell, change beta from four to two or whatever, hit the rerun button and it will rerun that step of the machine learning, and you can rerun whatever steps you want to follow that and get the new graphic right away.
David Johnston:
It's just, it can be an easier way, more fluid way of working where you don't have to always redo all the steps you did before, which you would do if you had a single script, for example. If you just ran a script at the command line, it would run all 20 steps. And if you want to make any change, you'd have to rerun the entire script again, where the ability to rerun cells gets you the ability to just change the part that you wanted to change. Which can help, especially if those steps, the first 20 steps, wherever they were, if they took a half an hour to run, it's a big gain to not have to rerun those steps and not have to rewrite code, which will cache the results and files and bring it back in. It just all happens right there in the notebook itself.
Zhamak Dehghani:
Yourself, how do you utilize notebooks in your day-to-day job?
David Tan:
For my personal projects, usually when I'm given a data set and I want to do something with it, I usually start a Jupyter notebook. Firstly, because it gives fast visual feedback, as David mentioned, you can see the plots, you can validate some of your ideas really quickly. And once that is done, the proof of concept, say I loaded my data, I've trained a simple model, I do some data cleaning before that, and I've got a model.
David Tan:
So I validated that it is possible to train a model with these parameters. So we can now pack it aside and then start to write it. We have tests and modules and functions, things like that. So that's one way that I use it. And another way that I do use notebooks, and in fact it's helped me bring myself into the data science world, through self-learning, through following along tutorials. So some may disagree, some people say that Jupyter notebooks are really confusing for beginners. My experience is that it actually helps me.
David Tan:
I can take a notebook from Kaggle, I don't know what it's doing. I can just run all and I can start seeing, okay, this is the plots. This is how they do data cleaning. This is how they do feature engineering. And so that kind of self learning code along, visual feedback is really useful for, teaching. Not just independent learning, but also when I'm teaching to a group, having platform where you don't need to worry about operating systems about Python run times, helps beginners just focus on the programming language itself. So I think of Jupyter notebook as a tool and like any other tool, like a knife, you can use it to carve a turkey, or you can use it to hurt somebody. And so Jupyter notebooks can be used in a good way or a bad way. And yeah, this is my experience with the good parts of Jupyter notebooks.
David Johnston:
I think the best thing about them is really for demos, like you were saying. It's almost like a self-documenting workflow, right? Once you've done what you've done, you can save that as a file and give it to someone else. And you don't have to write a presentation about all the steps that you did. The document itself is self-documenting, it makes really good demos. And I think that's the real strength of it.
Zhamak Dehghani:
I've been playing with this idea of using them, as you said, as documentation, but documentation of the underlying data. So in places where we really treat data as a product, as a reusable product for other people coming to use, like architectures like data mesh, there is always this desire to make the data more discoverable, understandable. Like what is the semantic underneath it? What are the statistic characteristics of the data? And we're kind of exploring using notebooks, computational notebooks as a way of documenting the data itself and tell a data story for someone who comes across dataset. I'm actually curious what do you guys think about that? It's a bit of an experiment for us right now.
David Johnston:
Yeah. I think one of the things there is that with data science, it's often when you do some step, there may not be any very easy way to test it. Right? But when you print out the results or make a graph or some visualization of it, it gives you more confidence that what you did was correct. And that's, a double-edged sword, right? Because it tends to be the case that people use visualization and these more qualitative means for testing things, which tends to distract them from writing real unit tests. But it's better than not testing at all, right? And so if you just write a script and run the script and get some result in the end and say, okay, well I guess it's correct. It created a file and therefore it must be correct." So it's better to see the individual steps, print stuff out, and get feedback along the way. It gives you certainly more information about whether or not you might've done things right or wrong, but it's not a replacement for unit tests.
Neal Ford:
Well, it sounds like it's a really intense feedback loop because as you're exploring things, you want the fastest possible feedback and it sounds like this is you basically wired up in an environment that gives you the fastest possible feedback as you tweak values and things in your model.
David Johnston:
Just doing graphics in general can give you answers to questions that you didn't even ask in a sense, right? You make a histogram of the results and you realize that everything has been classified as class one. Okay. Well, that's not a very good algorithm anymore. And just seeing that through a visualization like that, you expected to see a nice bell curve of results and you see it all piled up in one bin and you realize, "Oh, something's wrong there."
David Tan:
And about feedback loops, I feel that's the strength of notebooks. And also that is its main weakness. So I'm using the notebooks and a lot of people are using notebooks because of that fast visual feedback. If you run it and if it throws a stack trace, you know I got to fix something. If you see that all of your categories in bin one, you know something's wrong with the data. In my article, I drew this chart. Imagine x-axis, which is lines of code and imagine a y-axis which is time to get feedback to know that everything is still working. With notebooks there’s a large spike at the start. And then it really tapers down really quickly. For example, to know that everything is working, you've got to restart and run the entire notebook, look at some table, make sure the number 98.1 didn't regress to 95, something like that.
David Tan:
And along the way, you might see some error and then you've got to fix some things so that the feedback cycle becomes exponentially long. Whereas on the same chart, if we did the software engineering style of writing modules, having tests, then this feedback grows linearly. As your x-axis goes longer, you've got more and more lines of code, your feedback is yeah, maybe it's a constant? Yeah, it's constant, right? It's time to get feedback, just run all the tests. All your hundred tests passed. You still know everything is working within seconds, you know that. Instead of the notebook style where you have to restart and run off. So feedback loops is kind of the pros. And also at the same time, the con of notebook.
Neal Ford:
Well, I mean, this would not be the first time that we've gotten in trouble by taking something that is a massive interactive convenience, and then trying to move it into a more robust production-like environment. So both of you I think have strong opinions about trying to take computational notebooks and put them into production environments. So let's hear some of the downside then of trying to take this idea of interactivity too far.
David Johnston:
One way of looking at it is notebooks are very similar to spreadsheets. And they have a lot of the same benefits and weaknesses. So spreadsheets allow people who don't know a lot in the way of programming to take data, do some transformations, do some calculations that are important to them. You can also do things like make visualizations. You can make plots in Excel. But it has the same problems in the sense that... Which has to do with scaling. And spreadsheets are good for simple things, but they're not good for very complex things. We don't use spreadsheets to run the payroll at a major bank, for example. Even though you could, we don't do that and some of the reasons why we don't do that is spreadsheets are hard to test, and notebooks can also be hard to test for the same thing, for the same reason.
David Johnston:
With spreadsheets, we generally test through visualization. We have a calculation going on to take the first three columns, add them together and scroll that down through all the rows. And then we look at it and we say, "Oh yeah, we had one, two, and three, they should add up to six." So that didn't add up to six. I looked at the first row, I assumed the rest are all correct. Well, I'm not going to look at them all, but use that kind of visual feedback to say this probably works. And that can work for simple things, but for very complex things, if you have a spreadsheet which has hundreds of tables in it, and seven different tabs, will it still work? I mean, how do you know it works? So it's just hard to test in the normal way.
David Tan:
Yeah, another challenge with notebooks or where they fall down is the difficulty of modularizing them. Because you can write functions, but with most notebooks, it's there, you put it in another module. You kind of hide the complexity. So to understand the what of the code you're forced to read the how. So imagine you're going to a restaurant and you open up the menu and you're seeing the first item is put some oil in a pot, add some garlic, do this and that, turn up the heat, turn down the heat, still simmer. The implementation detail is really kind of overwhelming. There are many who could have better say this is like onion soup. There, done. You know this is onion soup. I'm not touching onion soup now, just pack it aside. I want to work on my thing.
David Tan:
So the second challenge I feel in addition to the testing is about the modularization. And as a tool, it makes me pick up these bad habits of not modularizing. It's so easy to just write code and get it work. And once it's working, I'm going to have a beer. Just walk away and forget to come back to modularize things. So there's another pain point about a notebook system.
David Johnston:
Testing things, one of the major problems of that is you're only visualizing it on the data that you ran it on. And generally speaking, we use notebooks to develop say, models for machine learning, where we're working off a static file. All right? We have a CSV file, we write the code, we put it in the notebook, we make all the charts, it looks good. The validation looks good. And then if we're going to hand that off to someone to turn into a production application, there's going to be new data coming in and it might not look good. With the new data, there may be some edge cases that it hits that we didn't have before, and it may break things. And if you have to go back and manually, visually inspect everything to get a feel for how well it's working well, that doesn't scale very well because now it requires a human to go in and manual intervention.
David Johnston:
And you want your testing to be automated, like I said. You want to have a unit test that just runs and checks for all the edge cases. And if you do get something new in the data, you want the monitoring to catch it and say, "We haven't seen this before. Stop." Instead of just running the model and getting some kind of garbage output.
Zhamak Dehghani:
I'm assuming that the, I guess, missing pieces or missing capabilities in notebooks that lead to them not scaling and not be suitable for production use is a common characteristics across different types of notebooks. It is not particularly a problem of our notebooks, the Jupyter notebooks, the DataWorks, it's a common characteristic. Is that a correct understanding?
David Johnston:
Yeah, I think so. I mean, the thing with the notebook to realize is it's just a script essentially. I mean, at the core, it's a script. It's running a sequence of commands. And then it has the visualization built around that. But at the core, it's still a script and therefore it has the same problem of thinking of scripts as the only way of writing code. I mean, you can write a 40 line script and it can be fine, it can work. But you don't write a 900 line script. I mean, some people do and they run into problems. But you definitely don't write a 9 million line script. You have to break code into modular pieces so that you can abstract what they do. You have a class over here called fit line to data and it takes in X and Y, has lists, and produces A and B, which are the coefficients of a line.
David Johnston:
And as Dave said, you don't have to know how it does that. And maybe you did write that code, maybe you didn't, but when you're making use of that, you don't want to have to think about how that works. You just know it fits a line to the data and now I can make use of that. So if you have a lot of modular pieces and you know each of those pieces works because you have a test for them, or a sequence of tests for them, and then you combine them together. Then you know the thing that you have combined together should also work because it's just running a sequence of those steps. And each of those steps has been tested to work. So then you know the whole thing works.
David Johnston:
But if you just write a 900 line script, which a lot of people do actually. Whether it's a notebook or just a script, it can be difficult to debug. You don't know that it works. If it doesn't work, figuring out why it doesn't work can be tricky. You have this giant name space, you might reuse a variable name, and that might create some really crazy output and may take a long time to figure that out. So when the software development world learnt how to write better software, it was about learning these steps, how to control the complexity by modularity and those kinds of things. You just can't do that in a script.
David Tan:
In terms of capability, another one I'm missing in notebooks is about developer productivity in terms of Atkins or Intellisense. So it's a little bit tactical and a little bit low-level, but when you're programming, you want to come to your work and be productive. You have an idea in your mind, you just want to type and let the idea help you. So in the software engineering world, if you're programming in Kotlin, Java, Python, whatever, usually the ID has some tooling to say these are the parameters you can pass into this function. This is the documentation of that function. That's how in software engineering, this is kind of a soft problem. With notebooks, I find myself always going to duckduckgo.com or search something like, "What is this API? What arguments does it take in?" Having to jump in and out of it. And it's a little bit of a lack in capability. I've seen some improvements in Colab and Jupyter. I think there's some kind of a configuration you can make, but out of the box, without that it's like a challenge to be writing code productively as well.
Zhamak Dehghani:
Yeah, and I think that's evident in the path we've been on, it looks like the computational notebooks where it started as a way of perhaps documenting your own process and your future-self coming and reading what you've done, perhaps sharing it, exploring, and now we're moving towards, "Okay, we need to mobilize a larger number of data scientists. They seem to be comfortable with this tool. Let's go into the mode of mass-production." And get all of these data scientists contribute to what the code that goes into production. But because we haven't gone through the process of that software engineering discipline, the tooling hasn't caught up, or the environment hasn't caught up to really treat this as a complex piece of software and building those modularity. So maybe this is the point that we say, "Oh, this is a diminishing return. This is not the right environment to try to build complex, long-standing productionized code." And rather than investing more in that, let's invest somewhere else to mobilize that large population of data scientists. I think this whole conversation around warning people around productionizing notebooks started with companies like Netflix, popularizing this idea and building a whole set of frameworks around it, and now we see, with platform providers like Databricks, they give you a path. They say, "Okay, this is your notebook. This is the CI/CD pipeline to get your notebook to production." And that has been appealing to a segment of data scientists, so I'm curious. If not notebook, if notebooks are not the right medium to create these longstanding, resilient, testable, maintainable code, then what is? What is the path from that exploration to production.
David Johnston:
Yeah. I think that's the main thing, right? So when you start on a data science project, it's often the case that you don't know if it'll ever go to production, because you don't know if it's going to work, right? It starts out as kind of a rough idea. "Maybe we can build a model to predict this," and, "Maybe this will be useful, and if it were, we could plug it into the application and do a lot of interesting things." So you start off with the exploratory phase, where you're looking at the data, and you're trying different models, you're trying different features, and at some point, you reach this point where you think, "Hey, this is actually going to work. This is actually going to create value, and using notebooks and also a lot of these tools, like pandas, for example. A lot of the tools available in Python are... They have been optimized to get you to that point as fast as possible, right? To the point where you say, "Hey, these things are actually going to work."
David Johnston:
But if you think about actually bringing anything to production, that's the first 10% of the work. The next 90% is building the application around that thing, making sure it works, building security, building monitoring, building the UI, the way... actually the customer making use of it. So being productive for that first 10% is good, but if that tool then gets in your way, so that you're not productive the rest of the project, then it's not that helpful as a tool for the whole workflow. It's a good tool for the beginning and exploratory phase, but if you keep working in that way, where you're continuing monitoring whether or not things work in a visual way, in a manual way... Notebooks encourage... Let's say you have a notebook, and you push it off to someone, and they put it in production, and then you want to continue to work on it. So if you just take your notebook and copy it to a new file, where you duplicated all the code, and then you make more changes to it.
David Johnston:
Well, if the file that you handed off, what if they find a bug in one part of that? And they fix that bug. It's not going to magically fix in your code that you copy over from that, because you're not reusing code. You're duplicating code. And as software developers know, duplicating code is always a bad thing. It's a risk for that reason. When something is bad and it's fixed, it doesn't magically fix in the places where it has been duplicated. So you run into a lot of cases where you are debugging things. You run into bugs, and you're debugging things, and you're troubleshooting. And if you're three months in, once you've handed off version one and version two of the models, you find the data scientist is spending all their time debugging and troubleshooting, as opposed to doing what they're actually good at, which is actually the data science skills that they have, creating models. And a lot of it is because they're still relying on these tools that were good for the exploratory phase in what should have become the production phase of the project.
David Tan:
Yeah. That reminds me of an article I read by Kent Beck called Partitioning Complexity, so one of the main techniques to help developers or data scientists be productive is to partition the complexity, right? I think that's why libraries like Secular is so popular. All of the implementation detail is gone. You just need to know the API and clean the data feature engineering, and then... So you can focus on that small slice of complexity you're faced with. So what you mentioned just now kind of reminds me of Jupyter notebooks being like glorified manual testing. There's beautiful charts, and it's so easy to be sold by that, but end of the day, it's manual testing, as David said, so we want to de-risk the work of data scientists and the whole team from manual testing, move away from there, and replace that with automated tests, as you say.
David Tan:
Yeah, I think, in the end, it boils down to scalability and about safety of the team, like as the data scientist who proved this concept, who's run this code, now this is going to be evolved upon. It's going to have new data, new features, and you want them to have an easy way of testing changes to that, and it's through what David described, through automated testing, through modularization to participation the complexity, so that when you want to change this one little thing, you don't need to take on the whole model and the whole data pipeline, feature engineering, and you want to partition complexity to make life sane, really.
Zhamak Dehghani:
So in your head, is there a clear transition? Like if I'm a data scientist and then I'm exploring and visually testing, and maybe it's okay for now, but then I'm getting more serious and gaining more confidence in the model that I've built, and I want to move it forward towards production, then where is that transition point that I have to move away from this tool to something else? If such a thing exists?
David Johnston:
So I think the important thing to understand is that, as a data scientist, you have to think of yourself as a software developer. You can't say, "I'm not a software developer. I never will be. I don't want to deal with that. I just want to write models and hand them off." That just doesn't really work very well because the developers do need to write code which they know works, right? So if you hand off code, which does a lot of transformations, they don't know if it works. They're going to have to refactor that code. They're going to have to take it and break it into smaller pieces and actually figure out how it works and show that it works. The interface itself, right at the machine learning part, they don't have to understand how that model works. That can be a black box. That's fine. That's the domain of the data scientist. But there could be bugs along the entire process, the whole pipeline of transformations and feature generation, and they need to know that code works. That has to be put in production. It's going to run on real data. It could create real problems if there's bugs there.
David Johnston:
So what needs to happen... As a data scientist, you have to work with the developers and learn a lot of the skills that they have learned. They have learned to write code which will work, which they can trust, and you need to learn those skills, and there's no really other way around that. So when you're working with a team of software developers, you should be trying to learn those skills of writing code yourself that is modular, can be tested, so that you can automate the tests as well, so that when you push off that code to them, they might want to make some changes, improve it, make it a little bit better, but you are then reusing that same code which has been modularized, and you're building on top of it.
David Johnston:
You can still use a notebook at that top level to make use of that code, but you shouldn't be creating these giant scripts that are not modular and not testable, because they're going to have to be transformed by the time they get to production, so you might as well learn what they are doing and do what they do, so that there's not a stage where bad code has to be poured into good code, and then that just creates a barrier between the two groups. The two groups really need to learn to share more skills, learn from each other. The data scientists learn what developers do well and bring in those skills and vice versa. The developers need to learn some more about how data science works, and the two working together should be sharing those skills and growing their skill sets.
Neal Ford:
It sounds like it's very important that data scientists view this as a prototyping tool, because I can imagine that you'll get push back from management, saying, "Well, you've already done this work. Why are you redoing it to put it in production?" You've got to explain that, no, this is really prototyping, and for real volumes of data and for real certainty and for all those engineering concerns, you really need to redo it. I'm using air quotes there that people can't see on the podcast. To redo it to make it suitable for a production-like environment.
David Tan:
Yeah. I had the same impression as well. About Zhamak question of, "When do we transition?" I see notebooks as proof of concept. In the UI work, we can have lo-fi UI. You can take a pen and paper, sketch some boxes on a kind of mobile app square, and you want to validate the idea, bring it to users. Like, "Would you click this?" "What would you do after that?" So notebooks, to me, are the same thing. I think two years ago there was some software I saw where you can draw, basically some boxes on your computer. Then it will generate the code for you. And my spidey senses went off, and it's like, "It's going to be messy behind the scenes. It will look nice on the front, but it's going to be hard to maintain and hard to extend."
David Tan:
Notebooks to me are the same. You want to validate, as David said. You want to validate. You want to fail fast. You want to prove the idea quickly. So the moment we've proofed it is possible, then there's no point in investing any more code or effort in this sketch or messy code base, so we put that aside and then start writing modules and tests. So you could start from... I think there are two approaches to this. You can start from scratch, re-implementing everything. Or you can try to say, "Okay, I've got this notebook. I would use NB, Jupyter, whatever, to convert it to a Python file. And I come up with this refactoring cycle."
David Tan:
So in software engineering, usually when you deal with a legacy code base, it's very scary to change it. You don't know what's going to break. So one of the first things you want to do is to be able to add a characterization test, to say, "When I run this script," as David mentioned, "from start to end, what is the visible artifact?" Is it a model? Is it an accuracy score of 98.1%? Is it a plot? What is that? Whatever that is, you write a characterization test to state that, so to automatically run that with every code change, and then kind of slowly, you start breaking off chunks into small functions you can TDD or test... You can do test-driven development to implement that.
David Tan:
You might, along the way, find bugs that you might fix in this process, but yeah, the smaller the notebook is at the start, the smaller problem it is to solve, so when to make the switch? I would say it's as early as possible.
Zhamak Dehghani:
It looks like there is an emotional element there as well, like, "If people start with this notebook and that becomes their whole world that encapsulates what they've put into it and the feedback that they've got," but there is a point that's, "Okay. This has done its job. It's throwaway, as prototypes are. And we have to start in the environment that lends itself to be a long-lived artifact." There is this... It's called the IKEA effect. You put an IKEA furniture together, and it looks pretty ugly, but you put it together yourself, so you're not going to throw it away. It seems like they have to get over that IKEA effect of creating these notebooks that have a shorter lifespan.
Neal Ford:
So David was talking about a collaboration between data scientists and developers and other engineers within a project. How do you facilitate that? How do you make that possible?
David Johnston:
Well, I'd say you start by saying, "Okay, you're on one team, and you have to deliver this product to production." If it's not in production, it's not creating any value for the company. So just delivering a model to a team who can't do anything with it hasn't created any real value. So you build teams around delivering value to production. So that means everyone in that team is responsible for that entire process. So if the data scientist creates a model and hands it off to a team, just hands it over the wall, and they don't really know what to do with it, or maybe they put it into production but it breaks. There's bugs, and it does the wrong thing because the data changes or something, you're obviously as much responsible for that problem, that failure.
David Johnston:
Now, if you are a data scientist and you say, "I don't want to learn how to program. It's not really my thing. Someone else can do that." You're actually going to spend a lot of time programming, because there's going to be a lot of problems, a lot of bugs, and you'll be troubleshooting all the time, which means you'll be spending all the time doing the thing you hate the most, which is programming. If you learn how to program well, you'll write code that works, and it will continue to work. And you'll actually spend less time doing that thing that you didn't like to do, which was, again, programming, going through code and troubleshooting. And you'll spend time doing what you want to do, which is actually to work on models, think about data, and where the information is.
David Johnston:
So it's ironic in that sense, because I know myself, I used to be a scientist. And a lot of scientists, they don't really like to code. They just do it because they have to do it, to do what they want to do. And I remember all the time I spent, before I really knew how to program, just troubleshooting, and how much I hated it. I cursed the language and everything. But it was just because I didn't really know how to write code well, that was testable, modular, could be rearranged easily, refactored. Once I learned how to do that, I actually spent less time doing all those things, doing the part of programming, at least, that's not fun at all. Because the code works. And if it's a bug that I put in, I find that right away, because the test fails, and I fix it. And then I move on.
David Tan:
About facilitating collaboration. I think, David, you wrote an article about don’t productionize notebooks. And one of the things that really caught my eye was about how it symbolized the deeper problem about collaboration between... We are productionizing notebooks because teams are not collaborating. It's like I wrote a notebook, throw it over the wall, somebody else would actionize these. Whereas a better operating model, as you described, cross-functional teams where data scientists learned from developers and developers learn from data scientists. And I think I've seen it work in my previous project, and that tight feedback loop and that capability uplift was very nice from that.
David Tan:
And a second point I wanted to make is about bridging this gap. Of course, everybody wants to be productive, wants to deploy awesome things into production. So when I wrote that coding habits or data scientist article, I shared it. And by chance, by accident, it blew up a little bit on the data. So especially, finally, it was a lot of our users and a lot of people in the PhD community that shared it and said, "This is the pain I'm feeling, this is what we should be doing."
David Tan:
So my sense is there's a desire within the data science community to want to do this. But the question is, how do we do that right? I think all of us, all four of us, we have the benefit of being at Thoughtworks. We work with really brilliant and smart people who do test driven development and refactoring and all the good things. So there are resources out there to share about these agile practices, about continuous delivery, about unit testing, about all of these good things. So yeah, I think in the show notes, we share some of these links, and these are hooks to start exploring this different world of software engineering, where data and software come together and share solutions to these problems that have been solved in the software engineering world.
David Tan:
And so, yeah, I just found it quite interesting that in the data science community, there's this demand, and it's just like, show me the way, what should I do next problem.
Zhamak Dehghani:
Yeah. I think I love the idea, I love... Those quite funny, that you end up doing the thing that you hate the most. But I really liked the idea of both ends of the spectrum, pure software engineers getting closer to understanding data science, and data scientists getting closer to understanding how programming works because that's the future. There's no way we can run away from this future. Future is becoming more digital, data-driven, intelligently augmented. So there is no escape from it.
Zhamak Dehghani:
And I think there is another element into this as well, which is the element of platforms. Platforms and technology and tooling will elevate the abstraction and hide the complexity of the metalwork that maybe a lot of us feeling and dealing with. And with having that, you mentioned this because people quite like it because it just hides away the complexity and it gives you some APIs to work with. And as that evolves, the thing that helps this cross-functional teams work more closely. I always wonder, this new connective roles that we create and we label is the right thing or not. The new role of ML engineer, someone who connects now the data scientists and the programmers and sit in the middle, is that really the right thing to do, as opposed to, well, everyone becomes somewhat of an ML engineer, because this is the tools that they need to know and the skills they need to have.
Neal Ford:
I think that's true. As we necessarily become more specialized because the things that we have to solve become more specialized, there's still a little bit of generalization that needs to creep in there, to create some baseline of consistent knowledge about engineering practices. And some of the things David was talking about, about testability, and some of the things that Dave was talking about, around modularity, those are both very important concepts that go beyond just data science, that creep into all the other aspects of software development, because that's what software is of, is software.
David Johnston:
Yeah. So the other thing is that it's very easy for a data scientist to become a bottleneck on a project because so much of the work is within their script, their notebook, that whenever anything needs to change or there's a bug anywhere, they need to fix it because it's in their wheelhouse. The more code you can get moved into the other parts of the code base, which are maybe more straight up software, transformations of data or the prep of the data, or even just the construction of the visualizations or... If you can move those things out of the wheelhouse of the core work of the specialist, then that means you can get the whole team involved and not be the bottleneck anymore. If there's a bug in the visualization, it's nice if you could say, "Oh, the devs can fix that." It's not, "Oh, the data scientist has to fix that because it's in the notebook part of the code." So you really want the whole team to be able to work on it as much as possible, such that the specialized parts are as small as possible.
Neal Ford:
So that was great. Thank you, David and Dave, David Johnson and David Tan. It's a great example of one of the things that we try to do at Thoughtworks, which is taking new capabilities like data notebooks and figure out ways to apply good engineering practices to them. So it sounds like both of you have taken that journey and produced some really good output for people to look at, and encourage our listeners to go dig deeper into this. It's a fascinating subject area and one that I think is going to continue growing, as time goes by. So thank you, David and David.
Zhamak Dehghani:
Yeah, thanks, David. Thanks, Dave.