The Changing Data Science And Data Engineering Tooling Environment

GETTY

GETTY

(This post is based on Jesse Anderson's work on data teams). As AI continues to become a focus for an increasing number of enterprises, these organizations are realizing how important it is to have the right people and skills in place. In particular, there has recently been a significant increase in demand for data scientists in organizations as AI, various applications of machine learning (ML), non-ML predictive analytics, and other so-called “big data” approaches continue to gain traction in the enterprise. In fact, the significant demand for data scientists has led to the talent crunch that we’re seeing across many enterprises and organizations. However, given that 80% of an AI project has to do with data preparation and data engineering activities, perhaps organizations should really be searching for data engineers even more than data scientists?

Companies are searching for and competing for increasingly scarce data scientists. Salaries and signing bonuses for skilled data scientists continue to skyrocket, and the sheer number of code academies that are now focusing on data science is evidence of the significant demand for data science skills. However, are data scientists always needed by these organizations? Many enterprises, vendors, and startups often confuse the role of data scientists and data engineers.  While these different roles share some traits and skills, at their core these are job descriptions that have two very different skill sets that are not easily interchangeable.

Data Scientists vs Data Engineers

In the mid-2000s, we saw the emergence of the Data Scientist position. As cited in the O’Reilly article: “This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value.” Not surprisingly, any organization that has data of value is looking at data science and data scientists to increasingly extract more value from that information.

Originating from roots in statistical modeling and data analysis, data scientists have backgrounds in advanced math and statistics, advanced analytics, and increasingly machine learning / AI.  The focus of data scientists is, unsurprisingly, data science — that is to say, how to extract useful information from a sea of data, and how to translate business and scientific informational needs into the language of information and math. Data scientists need to be masters of statistics, probability, mathematics, and algorithms that help to glean useful insights from huge piles of information. These data scientists usually have learned programming out of necessity more than anything else in order to run programs and run advanced analysis on data.  As a result, the code that data scientists have usually been tasked to write is of a minimal nature – only as necessary to accomplish a data science task (R is a common language for them to use) and work best when they are provided clean data to run advanced analytics on. A data scientist is a scientist who creates a hypothesis, runs tests and analysis of the data, and then translates their results for someone else in the organization to easily view and understand.

On the other hand, data scientists can’t perform their jobs without access to large volumes of clean data. Extracting, cleaning, and moving data is not really the role of a data scientist, but rather that of a data engineer. Data Engineers have programming and technology expertise, and have previously been involved with data integration, middleware, analytics, business data portal, and extract-transform-load (ETL) operations. The data engineer’s center of gravity and skills are focused around big data and distributed systems, and experience with programming languages such as Java, Python, Scala, and scripting tools and techniques.  Data engineers are challenged with the task of taking data from a wide range of systems in structured and unstructured formats, and data which is usually not “clean”, with missing fields, mismatched data types, and other data-related issues. These data engineers need to use their programming, integration, architecture, and systems skills to clean all the data and put it into a format and system that data scientists can then use to analyze, build their data models, and provide value to the organization. In this way, the role of a data engineer is an engineer who designs, builds, and arranges data.

Can there be a combined Data Scientist-Engineer role?

While it might seem that the roles of a data scientist and data engineer are distinct, data scientists and data engineers share many traits and skillsets. These overlapping skills include the necessity to work with and manipulate big data sets, programming skills to apply operations to the data, data analytics skills, and general fluency with systems operations. 

While the overlap of these roles is substantial, it’s clear that the emphasis of the role of the data scientist and data engineer are still distinct, and as a result, they’re not particularly interchangeable. Even more importantly, when interviewing and hiring data scientists and data engineers you need to make sure you’re asking the right questions and seeking the right skills from your candidate. Are you asking your data scientists to spend most of their time on data engineering tasks? Are you demanding more data science capabilities than your data engineers have had experience, training, and even aptitude or desire to do? Are you confusing your job candidates by asking them engineering questions in an interview for a data science position, or data science questions in an interview for what’s fundamentally a data engineering job?

More importantly, the rise of data science code academies, workshops, and training begs the question: are these training and code academy activities focused on the science behind data science, or the engineering and programming behind what is fundamentally data engineering? Or worse, are these activities muddling the mix by doing a bit of engineering with data science and not adequately focusing, or screening, their attendees and participants by determining which area of important big data and ML analysis that these individuals should focus on?

While it might seem that you can do a bit of engineering in a science role, or a bit of science in an engineering role, mixing the roles could be very detrimental to your organization’s success with an ML or data science initiative. Data scientists who are pushed to do engineering roles without the background, skills, or aptitude can easily misconfigure or misuse technology or write programs that are inefficient, costly, and waste time. Likewise, asking individuals who have fundamentally an engineering background to learn the complicated mathematics of data science can result in incorrect conclusions for an organization about its information that can lead to disastrous outcomes. Specialization is important – this is why doctors perform checkups and phlebotomists draw blood. It’s possible for the doctor to draw your blood and the phlebotomist to understand the lab results, but why would you want to risk your comfort and health this way?

Where does the Data Scientist fit in your organization?

Most organizations need both data science and data engineering roles if they are trying to address problems that require data science solutions. Since these roles are not interchangeable, it is improper to try to seek the single, magical, data scientist-engineer unicorn. Yet, while you might need multiple data scientists and engineers in your organization, the ratio between the two is rarely 1:1. For most organizations, it makes sense to have more data engineers than data scientists. The reason for this is that data scientists have learned to operate with large volumes of clean data, but getting lots of clean data from many disparate systems can be multiple full-time jobs. It simply takes more work to move and clean data than it does to conceptualize data models and run analyses against the data sets.

Also, the organizational reporting structure for the data scientist can be incorrect at organizations. Frequently, the data scientist roles are reporting to the technical team. However, this doesn’t make sense. The data scientist isn’t (usually) asking technology-specific, implementation-specific questions and data analyses. Often the challenges the data scientist is facing are line-of-business specific. As such, the data scientist should report to the strategic decision-making parts of the business that represent the specific lines of business that the data scientist is assisting.

If Data Scientists are business-centric roles, will we see business-centric tools for Data Scientists?

If data science and engineering are truly separate roles in the organization, then it makes sense to think of the tools they need as separate as well. Many vendors entering the data science / machine learning world are muddying the waters and making things even more confusing. They’re saying their tools are for data scientists, but everything about those tools are primarily for data engineers, with a sprinkling of data science on top and this doesn’t make sense. The natural environment for a data scientist is in an analytic, data-oriented, model-centric tool, not something that has big buttons for cleaning data, moving data, and moving stuff from a private environment to the cloud. This is like giving a driving instructor the pieces of a car and saying “build this car yourself, and then teach others how to drive it.''

Rather than engineering and programming-centric tools, data scientists need data science-centric tools. Right now there’s a growing collection of these tools, often emerging from data or predictive analytics environments that suit the needs of data scientists. However, it’s possible that even more business-centric tools might be appropriate, especially as the data scientists become more embedded with the line of business. For example, decades ago if you wanted to operate on large volumes of data in a spreadsheet-like format, this involved programming, but tools like Excel introduced things like pivot tables and now business managers are able to perform all sorts of analyses. It’s only a matter of time before tools like Excel embed data science capabilities, or business-centric data mining and analysis tools into their products.  

As the talent gap for data scientists continues to widen, there is no doubt that we will see new tools created out of necessity to allow non-technical (read: business) people to run, test, and analyze data. Strategic business managers will begin to learn data science, without needing or wanting programming or data integration experience.  Traditional data scientists will still be needed to run very complex analysis of data. For the most part, however, basic analysis will move more to the business unit due to increasingly easy-to-use tools. This means we have still yet to see which tool or technology will be the dominant one for ML and data science in the enterprise.

Read more on this subject in Jesse Anderson's Data Engineering Teams book.

Follow me on Twitter. Check out my website