Welcome to the documentation for NEANIAS Distributed Machine Learning using SparkML


Scaling up machine learning solutions other than deep neural nets (for classification, regression and clustering) can be carried out using the Apache Spark (https://spark.apache.org) framework. Spark is a mature solution in distributed processing, but still very current and often adopted in the big data context; the Spark ML lib provides APIs for Python, Java and R, it will be integrated with NEANIAS service C3.1.

C3.4 represents a Spark based solution for distributed training and execution of ML tasks.

More detailed documentation follows in https://gitlab.neanias.eu/c3-services/c3-4/c3-4service and in near future will be deployed in a public github repository.


A Jupyter server where to test some example Spark notebooks is available at In near future, the access point will be the one offered by the C3.1 service (see: https://docs.neanias.eu/projects/c3-1-ai-gateway/en/latest/), granting access to this service.


Since Spark is resource intensive, the above accessible service is quite limited for real computation, since it is publicly accessible and the resources we have are limited. To be granted a more complete access to the service please send a request to thomas.cecconello@unimib.it with details on the amount of resources required. If the request is approved, a Kubeconfig file to access the resources will be provided. Furthermore, from C3.1 will be possible to access the Spark environment. Should you have already access to own computational resources (even in other cloud environments) we can support you in installing the service on your premises / cloud.

To access the jupyter test notebook use that token dfe0bae9f9e14ff2c6a7f18548175e4c4491ae917a54bfa3


The service provides multiple ways of usage. A Spark application could be launched remotely without the need to monitor the execution, or a Spark application could run interactively within a Jupyter notebook. First solution is called cluster mode and it can be launched using one of the following ways: a local Ppark distribution; the Kubernetes Spark operator; Spark Airflow.

The second solution is called client mode. To test client mode open the endpoint above indicated. In Jupyter an example of usage can be found in “hyperopt.ipynb” notebook. First and last cells of the notebook are crucial because those spawn and kill Ppark workers. Due to the current resources limitation it’s important to close the Spark context at the end of the usage. The middle cells show how to find the best hyperparameter in order to classify MNIST dataset.


Please, contact Thomas Cecconello ( thomas.cecconello@unimib.it ) for any assistance.