Agile and Lean techniques seem to be the best way we currently know to create complex software in the face of risk, uncertainty, and changing requirements. Agile hinges on embracing and adapting to change by enabling rapid feedback cycles and evolutionary development. However, bringing agility into big data (and small data) analytics has been a challenge for many, very bright and talented, data scientists and engineers. In this article we’ll explore what makes analytics uniquely different than application development, and how to adapt agile principles and practices to the nuances of analytics. We’ll also examine how the disciplines of data science and software development complement one another, and how these intersect in an agile project environment.
The data scientist, the software developer, and the data engineer
First let’s look at what differentiates analytics experts from software developers. C.F. Jeff Wu first introduced the term “data science” in 1998 as a discipline that encompasses statistical analysis, science, and advanced computing. The use of analytics by social media companies like LinkedIn, Facebook, and others in recent years has boosted the popularity of “data scientist” such that Harvard Business Review published an October 2012 article entitled “Data Scientist: The Sexiest Job Title of the 21st Century.” Simply put, a data scientist has a unique, and very deep, blend of the skills depicted in Figure 1.
Figure 1: The Disciplines of Data Science, Source: Calvin Andrus, Wikipedia
Data science skills are both complementary to, and overlapping with, software development skills. Data science requires programming, but data scientists are not often trained in modern software engineering practices. Conversely, many developers have skills in data engineering, advanced computing, and statistics, but these are not commonly their areas of deep expertise. Data scientists commonly code in multi-paradigm languages like R and Python, which have powerful statistics libraries and an active research community behind them.
Data engineering is the bridge between data science and software development. A data engineer supports the data scientist in data discovery, harvesting, and preparation. Data engineers support developers in operationalizing analytical models for production deployment, which we will discuss shortly. This role requires expertise in data management technologies (“big data”, NoSQL, and SQL), data modeling, data architectures, and data manipulation languages and techniques.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.