For data scientists, drudgery is still job #1

For data scientists, drudgery is still job #1

Data cleaning and preparation still eats up nearly half the workload of data scientists

Credit: Dreamstime

The hassles of data intake and cleaning, problems with biased models and data privacy, and difficulty finding experience and technical skills—all these ranked among the biggest challenges facing data scientists and software engineers in data science disciplines according to a newly released survey.

Anaconda, makers of the Python distribution of the same name for scientific computing applications, conducted its 2020 State Of Data Science survey with 2,360 respondents from 100 countries, slightly less than half of those hailing from the U.S.

Despite all the advances in recent years in data science work environments, data drudgery remains a major part of the data scientist’s workday.

According to self-reported estimates by the respondents, data loading and cleaning took up 19 per cent and 26 per cent of their time, respectively—almost half of the total. Model selection, training/scoring, and deployment took up about 34 per cent total (around 11 per cent for each of those tasks individually).

When it came to moving data science work into production, the biggest overall obstacle—for data scientists, developers, and sysadmins alike—was meeting IT security standards for their organisation.

At least some of that is in line with the difficulty of deploying any new app at scale, but the lifecycles for machine learning and data science apps pose their own challenges, like keeping multiple open source application stacks patched against vulnerabilities.

Another issue cited by the respondents was the gap between skills taught in institutions and the skills needed in enterprise settings. Most universities offer classes in statistics, machine learning theory, and Python programming, and most students load up on such courses.

But enterprises find themselves most in need of data management skills that are taught only rarely or not at all, and advanced math skills that students don’t often develop.

Students themselves felt lack of experience (40 per cent) and technical skills (26 per cent) were the biggest barriers to jobs in the field, shortcomings that (according to Anaconda) could be better addressed by strong internship programs that “go beyond providing a résumé enhancement and hands-on-keyboard technical skills.”

One finding in the report shouldn’t surprise anyone: Python remains king of the languages used in the data science space. R comes in a distant second, while JavaScript, Java, C/C++, and C# trail behind.

Although Julia, a rising contender in the data science world, wasn’t listed in the running, it’s unclear if that was because it didn’t figure into enough respondent’s answers or because the survey didn’t mention it.

Tags data scientistdata

Show Comments