logo le blog invivoo blanc

Explore and analyze your data with apache zeppelin – part 2

24 June 2020 | Big Data | 0 comments

Welcome back to our second part about Apache Zeppelin. In ‘EXPLORE & ANALYSE YOUR DATA WITH APACHE ZEPPELIN – Part 1’ our previous post, we introduced Apache Zeppelin as one of the best Big Data tools to your Data Analytics use cases and shared details about various back-end interpreters and languages Zeppelin supports. We strongly recommend reading that article first before continuing with this one.

In this Part2 we are going firstly, to discover Zeppelin Dynamic forms and how they can help you to use Zeppelin Notebook as a collaborative and configurable dashboard between your teamwork of Analysts. Second, we are going to enumerate the big differences between Apache Zeppelin and Jupyter Notebook considered as one of the most famous and used Notebooks for Data Analytics. Finally, we are going to do a wrap up on the most recommended Apache Zeppelin Best practices you should follow.

Apache Zeppelin dynamic forms: A collaborative and configurable Notebook

  1. Quick overview on Zeppelin Dynamic forms:

Apache Zeppelin dynamically creates input forms. Depending on language backend, there’re two different ways to create dynamic form: Using Form Templates or by Using Programming Language.

  • Using Form Templates 

You can create a text-input form, select form, and a check box form using form templates. You can change the values in the input fields and rerun the paragraph as many times as required.

A simple form can be created as shown in this example:

A simple check box form can be created as shown in this example:

For more information, see using form templates.

  • Using Programming Language

You can create a text input form, check box form using the Scala (%spark) and Python (%pyspark) interpreters. As in using form templates, by using the programming language, you can also change the value of inputs and rerun the paragraphs as many times as required.

A simple form can be created as shown in this example:

For more information, see creating forms programmatically.

Zeppelin creates interactive forms and results visualization in a faster way

  1. Advantages of using Zeppelin Dynamic forms:

Thanks to Zeppelin’s Dynamic forms, the Notebook can be used as a set of dashboards according to project metrics, and a common store of scripts and uploads, a space of teamwork of analysts.

Developers involve Zeppelin’s Dynamic forms in the code and display the results. After that Business Analysts use these Dynamic forms and have also the possibility to change their entry parameters several times. So that each time they could display automatically the corresponding results and visualizations. Based on these results they could do their functional Analysis.

In Zeppelin Notebook you can hide the code, make fields for entering dates and other changing parameters, thus giving the customer a neat and understandable dashboard.

Apache Zeppelin vs Jupyter: What are the differences?

Developers describe Jupyter as “Multi-language interactive computing environments“. The notebook combines live code, equations, narrative text, visualizations, interactive dashboards and other media. On the other hand, Apache Zeppelin is detailed as “A web-based notebook that enables interactive data analytics“. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more as we have seen previously.

Jupyter and Apache Zeppelin are both open source tools. It seems that Jupyter with 5.99K GitHub stars and 2.54K forks on GitHub has more adoption than Apache Zeppelin with 4.23K GitHub stars and 2.1K GitHub forks.

  • Multi-User Capability

Unlike Zeppelin, Jupyter does not support multi-user capability by default

  • Plotting Of Charts

When it comes to plotting charts, Zeppelin wins hands-down because you can use different interpreters in the same Notebook as well as plot various charts. By default, Jupyter has zero charting options but you can obviously use the existing charting libraries.

  •  Report Describing

Both Notebooks have markdown support but unlike Jupyter, Zeppelin creates interactive forms and the visualization of results in a faster way. Also, the result is more accurate and easily accessible to the end users, and could be also exported in “CSV” of “TSV” format.

Zeppelin unlike Jupyter allows the option to hide the code, thus giving readable, interactive reports to the end users.

Jupyter has plotly lib that outputs the chart in notebook whereas Zeppelin supports only Matplotlib’s (which is a Python 2D plotting library) content that just saves the output in HTML-file.

  • Security

Right now, Jupyter has no such privacy configuration of the end users. On the other hand, in Zeppelin, you can create flexible security configurations for the end users in case they need any privacy for their codes.

Zeppelin supports multi-user configuration via LDAP/Active Directory connectivity and specifically defined security groups. It uses only one server process, authenticating users in the configured system before allowing further access. Zeppelin gives the possibility to share notes only with specific persons, with specific permissions.

  • Cluster Integration

Zeppelin is part of the Hadoop landscape and integrates well with other Hadoop applications such as Spark, Pig, Hive and others.

  • In-line code execution using paragraphs

Unlike Jupyter, Zeppelin’s biggest advantage is that it allows the combination of multiple paragraphs into one line.

  • Auto-completion feature

Jupyter’s code editor and paragraph editor seem to be much more effective though, with more hot keys and a great auto-completion feature.

  • Production environment

As Zeppelin depends on cluster capacities, if production resources are not sufficient enough and/or with a large number of users (more than 10 users) Zeppelin may keep crashing, hanging and getting unresponsive, notes tend to get unloadable due to size errors, or may have a very slow execution compared to Jupyter

In conclusion, Zeppelin is the better tool to use if the data analyst/scientist develops in the Hadoop world. It provides good integration with other Hadoop systems such as Spark, Pig and others and streamlines the development for Spark applications. It provides a better integration of larger teams, but it seems more geared towards enterprise users, having great LDAP integration, permissions management, enough cluster resources, and so on.

Jupyter requires less overhead in the setup and productionization of developed patterns due to the standalone nature. Due to the large number of extensions and integrations, specifically into Machine Learning and AI frameworks it has developed into the more popular choice among analytics users.

Zeppelin Best Practices

  • Install & Versions

Leverage Ambari to install Zeppelin and always use the latest version of Zeppelin. Zeppelin latest available version contains many useful stability & security fixes that will improve your experience.

  • Deployment Choices

While you can select any node type to install Zeppelin, the best place is a gateway node. The reason gateway node makes most sense is when the cluster is firewalled off and protected from outside, users can still see the gateway node.

  • Hardware Requirement 

More memory & more Cores are better: minimum of memory 64 GB node and minimum of 8 cores.

Number of users: A given Zeppelin node can support 8-10 users. If you want more users, you can set up multiple Zeppelin instances. More details in MT section.

  •  Security

Like any software, the security depends on threat matrix and deployment choices. This section assumes a MT Zeppelin deployment.

  • Authentication
  •  Kerberize HDP Cluster using Ambari
  • Configure Zeppelin to leverage corporate LDAP for authentication
  • Don’t use Zeppelin’s local user based authentication, except for demo setup.
  •   Authorization

Limit end-users access to configure interpreter. Interpreter configuration is shared and only admins should have the access to configure interpreter. Leverage Zeppelin’s shiro configuration to achieve this.

With Livy interpreter:  Spark jobs are sent under end-user identity to HDP cluster. All Ranger based policy controls apply.

With JDBC interpreter:  Hive & Spark access is done under end-user identity. All Ranger based policy controls apply.

  •  Passwords

Leverage Zeppelin’s support for hiding password in Hadoop credential for LDAP and JDBC password. Don’t put password in clear in shiro.ini

  •   Multi – Tenancy & HA

In a MT environment, only allow admin role access to interpreter configuration.

A given Zeppelin instance should support only < 10 users. To support more users, setup multiple Zeppelin instance and put a HTTP proxy like Nginx.

  • Interpreters

Leverage Livy interpreter for Spark jobs against HDP cluster. Don’t use Spark interpreter since it does not provide ideal identity propagation.

Avoid using Shell interpreter, since the security isolation isn’t ideal.

Don’t use the interpreter UI for impersonation. It works only for Livy & JDBC (Hive) interpreters.

Users should restart their own interpreter session from the notebook page button instead of the interpreter page which would restart sessions for all users.