More than a decade ago, what is now commonly known as the Big Data era started with the emergence of Hadoop. Since then, a multitude of technologies were introduced to fulfill multiple tasks within the Hadoop ecosystem, with capabilities ranging from processing data in memory to presenting unstructured data as relational tables, with dashboards being the de facto standard for making sense of the mind-boggling amounts of data being produced on a daily basis.
At first, these technologies checked all the boxes for building a reliable, production-ready, data architecture capable of processing multiple streams of data. But as companies kept delving deeper into their data, the flows started getting too complex to manage. Applying ETL patterns by merely replacing the technologies doing the work wasn’t a sustainable solution, because when applied on clusters of thousands of nodes these patterns rapidly lost their efficiency and became unmanageable.
Spark jobs scheduled through cron, a mechanism that at first allowed companies to have valuable insights, rapidly proved that it’s merely a first step in a very long road.
To organize their ever-growing data architectures, companies invested in tools that would allow them to manage their data pipelines efficiently. Technologies like Airflow and Luigi were introduced by some of the companies that struggled the most with overly-complex pipelines (Airbnb and Spotify, respectively).
These workflow-management systems gave Big Data enthusiasts some breathing room to restructure their workflows and to allow their recently-introduced data architectures to scale properly.
With Airflow managing to establish itself as the go-to data-flow management platform, it feels like time to tackle the next issue within the ever-changing Big Data ecosystem: scripts.
Why are Big Data scripts a potential problem?
The current norm in Big Data scripts is writing the script in the desired language, whether it’s Scala, Python or even Java, and scheduling it through a scheduler to process the data with the suitable frequency. This mechanism presents three major problems :
- Debugging becomes a hurdle: Even if your scripts are neatly structured, finding a proper fix when something goes wrong can get very complicated. That’s mainly because you need to rerun the script following every change you make, and when you’re working with Big Data, rerunning even a single query could mean that you’re in for hours of waiting. Workflow management systems are still very limited when it comes to error reports, so when something goes wrong, you’re on your own.
- Collaboration between different roles is nearly impossible to achieve: Data is one of the fields where the lines between the different roles are still somewhat blurred. Processing data through scripts stored on designated files means that a certain number of technologies are used solely for this purpose within the data architecture. A data scientist and a data engineer may find themselves working on completely different technologies, but the fact that they both rely on the same data sets to carry out their roles means that an ideal architecture would consist of a minimized number of technologies, to allow people with different skill-sets to collaborate on the data pipelines.
- Processing and exploration are done separately: A problem that surfaced following the first wave of Big Data implementations is the consequences of separating functionalities within Big Data architectures. Relying on a certain set of technologies solely for processing data and then using other technologies to visualize and explore it might have been practical ten years ago, but in this era in which we’re talking about Fast Data rather than Big Data, this separation no longer suits most corporations. Awaiting the execution of the bulk of your pipelines before being able to explore your data means that you’re always one step behind. Instead, accessing the processed data in real-time as the pipeline is gradually executed allows for predictive decision-making and data-driven decisions.
With these three problems in mind, it becomes apparent that relying on scripts as the core of your Big Data architecture is a mechanism that will only get harder to manage with time.
Why are notebooks the answer to this problem?
Notebooks, an ambitious concept that started back in 2001 with the release of IPython, were until recently considered a niche product used mainly by data scientists to explore small data-sets.
A notebook is simply a JSON document containing text, code, media output, and metadata. It relies on a powerful protocol that allows it to do wonders with such documents, thanks to the multiple metadata fields it offers and its impressive modules.
Thanks to the interactivity that they offer, they rapidly gained popularity within the data scientists community, and Project Jupyter was introduced in 2014 to huge success. The project was built on top of the IPython modules and introduced a set of new functionalities.
By offering kernels in multiple programming languages, Jupyter democratized notebooks and gave the concept the necessary push it needed to start penetrating fields other than data science.
So how do notebooks solve the three problems discussed above?
- Cells make debugging easier than ever: One of the core elements that notebooks rely on is cells. A cell is a portion of the notebook that may contain code (code cells) or text (markdown cells), you can consider it as the notebook’s main building block. Through cells, you could divide your scripts into a number of code portions that are ran separately, while still sharing the same kernel and data. This capability allows for much easier debugging and an elegantly structured script.
- Collaboration is a given: The issue related to collaboration gets automatically solved when switching to notebooks. By offering interactivity, direct access to data, and multiple kernels for different languages, notebooks effortlessly place themselves at the center of the company’s data architecture, and allow users with different roles to use them in order to respond to a vast range of needs.
- Exploration and visualization are immediately available: Unlike the usage of scripts, using notebooks allows for a direct access to data through code cells. These cells can be executed progressively as the jobs are ran within the data pipelines, allowing immediate access to production data. Additionally, visualizations can be rendered directly within the notebook through multiple ways, which makes it possible to explore data without the need to switch tools.
Thanks to these characteristics, notebooks can place themselves as an impressive alternative to old-school scripting when it comes to Big Data, and thanks to new notebook-based tools that started to emerge during the past few years, this is merely the beginning of the Notebook era.
The pioneers of the notebook era
As mentioned above, IPython was for over a decade the only notebook-based system available, and suffered for a long period from multiple issues (notably its inability to properly reload kernels). Jupyter solved most of these issues but at first focused mostly on offering functionalities needed mainly by data scientists, without expanding the range of use-cases notebooks could be used in.
Notebooks’ first dive into Big Data came in 2014 with Apache Zeppelin, a multi-purpose notebook that offered interpreters allowing it to be used seamlessly with the most prominent Big Data technologies (notably Apache Spark).
It then only took two years for yet another breakthrough. Thanks to the Jupyter community that never ceased to expand, tools and modules were continuously being introduced within its ecosystem, and this stream of powerful new functionalities culminated in the release of nteract in 2016. Thanks to the vast set of packages that it relies on and its composability, nteract rapidly turned into a notebook-centric ecosystem of its own, with tools like Papermill and Commuter offering functionalities like parameterizing notebooks and offering immediate collaboration.
The bold choice made by the Netflix Data Platform team in 2017 and discussed in two extensive blogs posted on the Netflix Technology Blog to switch to a notebook-based architecture proves how powerful notebooks can be when accompanied by the right tools and implemented within a suitable ecosystem. And as one of the major players of the Big Data era, Netflix’s choice proves that notebook-based architectures are here to stay.
Zeppelin and nteract might be taking different roads, but that only proves how vast the range of use-cases can be when it comes to notebooks, and that there has never been a better time to give interactivity a try!
Try the magic for yourself!
To give you a taste of how simple and efficient scripting through notebooks can be, let’s create our own data architecture and enjoy the perks notebooks have to offer.
For this example, we’ll be using Quandl’s WIKI Prices dataset, so first let’s start by creating a free Quandl account to receive an API key. This dataset is no longer being updated by Quandl, but we could use last year’s data. Assuming that Quandl updates the database daily by midnight, we’ll build a simple pipeline that does the following:
- Retrieves the past day’s data from Quandl
- Rearranges it in a form that suits us
- Visualizes it accordingly
To do that through scripts, in an old-fashioned Big Data pipeline, we’d need to use a separate tool for the visualizations, and we’d have to monitor our pipeline and our output through logs and tests.
With notebooks, everything is centralized in the same tool, and the whole pipeline might be assembled into a single notebook.
We’ll be using Apache Airflow to manage our pipeline, nteract to create the notebook, and Papermill to parameterize it.
Preparing the environment
First let’s start by installing the Quandl Python package by simply running:
pip install quandl
Installing nteract is also very straightforward, by simply choosing the version corresponding to your OS. Then, we’ll proceed to add the Python kernel through the following commands:
python -m pip install ipykernel virtualenv
python -m ipykernel install
Creating the notebook
After setting up the environment, we can start the fun part of the project. Let’s start nteract and switch to the Python kernel, and then we’ll begin by using the first cell to import the different modules that we’ll need:
To use nteract’s data explorer, we need to enable the following pandas mode:
Secondly, let’s retrieve yesterday’s closing prices for all the companies on the Quandl database (note that the year was fixed to 2017 because Quandl no longer updates this database):
After that we can immediately access our data and visualize it within the notebook without the need to import any external modules thanks to nteract’s data explorer:
Parameterizing the notebook
Now that the notebook is ready, let’s proceed to parameterize it through Papermill. To do so, we simply need to add a
parameters cell that will contain the parameters we want to pass in at execution time. So assuming that we would like to specify the date for which we want to retrieve the stock prices at execution time, we can turn the variable
yesterday into a parameter through the cell’s menu:
Creating the Airflow DAG
The final step would be adding an Airflow DAG that runs the notebook through Papermill:
As simple as this was, it is a very scalable process. An Airflow cluster can handle an enormous number of DAGs efficiently, and thanks to the multiple tools in nteract’s ecosystem (Papermill notably), a notebook-based Big Data architecture no longer feels like a wild thought or a difficult-to-achieve concept.
Sure, plenty of problems still need to be addressed, and version controlling is still more complicated to achieve than with classical scripts, but when taking into account the advantages that this concept offers, it becomes a challenge worth the effort.
- The Jupyter Notebook as Document: From Structure to Application — JupyterCon 2017
- Jupyter: Kernels, Protocols, and the IPython Reference Implementation — JupyterCon 2017
- Notebooks at Netflix: From analytics to engineering — JupyterCon 2018
- From Idea to Product: Customer Profiling in Apache Zeppelin with PySpark — PyCon South Africa 2018