Apache Spark has been accepted as a powerful and convenient replacement for MapReduce. It is much more accessible, especially because it can be used with Python.

In March 2016 I gave a presentation to the Montreal Python meetup to show how Apache Spark can be used for simple data processing applications. In many cases it might be better to use frameworks we already know well like Scikit-Learn, but sometimes alternatives have something interesting to offer. Spark is still built upon the Hadoop foundation and gives easy access to big data technologies like Hive, HDFS and so on.

How it all started

On March 2nd, 2016, Mathieu, the organizer of Montreal Python asks me to follow up on a promise I made to present at the next user group meetup:

19:51 <mlhamel> alors toujours cool pour le talk ? tu as idées de talk ?
19:51 <mlhamel> titre et sujet ?
19:56 <deuxpi> oui, ça se précise :)
19:56 <deuxpi> Le plan est de faire de un peu d'analyse musicale avec PySpark

Two weeks before the event, I didn’t have any presentation material available, but it seemed like it would be possible to put something up that would be original without taking too many risks. The first idea was to build a new presentation from parts of code and data that I know well and that I have had on my laptop for a long time. The challenge would be to put them all together and hope that it makes a coherent and entertaining talk.

I started by removing the dust off an old homework project from university. This program used the GZTAN genre dataset and produced a simple visualization of the capacity of a computer to recognize music genres. The project stopped at presenting a graph and documenting the whole process. Support libraries such as scikit-learn were to be released two full years after this code was written so pretty much everything had to be built from raw numpy functions.

Visualization of genres

The GZTAN dataset is well known and has been used and reused by researchers as the starting point for musical genre analysis and has become some kind of benchmark. It’s one of the few easily obtainable datasets with audio samples and almost naive genre labeling. Thierry Bertin-Mahieux has written about this dataset and hints at the limitations caused by its simplistic tagging. You can find several articles written from the early 2000s until recently that use the GZTAN dataset.

Modernizing and riding the hype with Spark

Analyzing audio samples and processing for classification (which is part of a domain called Music Information Retrieval) is not a problem where there is much research today. The methods and algorithms are probably still is use in popular services such as Spotify and Google Play Music, but research seems to have moved to the social and cultural aspects of music with the goal of providing better personalized playlists.

The slides from this talk are available as well as the complete code and data.