2014 In Review: Five Data Science Trends

2014 was another transformative, exciting year for data science. Summarizing even one measly year of progress is difficult! Here, in no particular order, are five trends that caught my attention. They are are likely to continue in 2015.

Adoption of higher productivity analytics programming environments. Traditional languages and environments such as C, C++, and SAS, are diminishing in importance as R, Python, and Scala ascend. It is not that data scientists are dumping the old stuff; it is that a flood of new data scientists have entered the fray, overwhelmingly choosing more modern environments. These newer systems provide language conveniences as well as a rich library of built-in (or easy to install) libraries that handle higher abstraction analytics-related tasks. Modern data scientists don’t want to write CSV read routines, JSON parsers, SQL INSERTs or logging systems. R is notable in the sense that its productivity gains come from its packages and community support, not from the language itself. R’s clunkiness will be its downfall as Python, Scala, and other languages gain greater traction within the analytics community and narrow the libraries gap.

Machine learning is king. Data science, however you define it, is a broad discipline, but the thunder belongs to machine learning. Rather, machine learning is becoming synonymous with data science. Optimization, for example, is increasingly being described as a backend algorithm supporting prediction or classification. Objective functions are described as “learning functions”, and so on. We are collectively inventing definitions and classifications as we go along, so in some sense this is merely semantics. There is a real problem however: we risk ignoring the collective wisdom of past masters, throwing rather broad but shallow techniques at models with rather particular structure. Somewhere in a coffee shop right now, somebody is using a genetic algorithm to solve a network model.

Visualization is everywhere. To the point that it’s starting to get annoying. Whether in solutions such as Tableau or TIBCO Spotfire, add-ons such as Excel Power View or Frontline’s XLMiner Visualization app (plug!), or programming libraries such as matplotlib or d3.js, the machinery to build good visualizations is maturing. The explosion of infographics in popular media have raised expectations: users expect visualizations to guide and inform them at all stages of the analytics lifecycle. As you have no doubt seen, the problem is that it is so easy to build shitty viz: bar charts with no content; furious heat maps signifying nothing. We’ll start to see broader, if unarticulated, consensus on appropriate visualizations and visual metaphors for the results of quantitative analysis. I hope.

Spark is supplanting Hadoop. Apache Spark wins on performance, has well-designed streaming, text analytics, machine learning, and data access libraries, and has huge community momentum. This was all true at the beginning of 2014, but now at the end of 2014 we are starting to see a breakthrough in industry investment. Hadoop isn’t going anywhere, but in 2015 many new, “big time” big data projects will be built on Spark. The more flexible graph-based pipeline at the heart of Spark is begging for great data scientists to exploit – what will 2015 bring?

Service oriented architectures are coming to analytics. The ubiquity of REST-based endpoints in web development, combined with a new culture of code sharing, have engendered a new “mixtape” development paradigm. A kid (or even an older guy…) can whip out their MacBook, create a webservice on django, deploy it on AWS using Elastic Beanstalk, connect it to interactive visualizations in an afternoon, and submit an app that night. Amazon has built a $1B+ side business on the strength of cloud-only services, and same these forces will drive analytics forward. The prime mover in data science is not big data. It is cloud. RESTful, service-based analytics services will explode.