Welcome to the documentation for NEANIAS Distributed Machine Learning using SparkML


Scaling up machine learning solutions other than deep neural nets (for classification, regression and clustering) can be carried out using the Apache Spark (https://spark.apache.org) framework. Spark is a mature solution in distributed processing, but still very current and often adopted in the big data context; the Spark ML lib provides APIs for Python, Java and R, it will be integrated with NEANIAS service C3.1.

C3.4 represents a Spark based solution for distributed training and execution of ML tasks.


The service is available at


The service currently takes the form of a Jupyter Notebook (the token to be indicated to have access to it is ac7f40ca0cfe9f477139f90ed90bcfda1d3bf9b17f447913).


In jupyter an example of usage can be found in “segmentation.ipynb” notebook. First and last section of the notebook are important because those spawn and kill spark workers. Due to the current resources limitation it’s important to close the spark context at the end of the usage. In the middle the example shows how to train a k-means clustering on a galaxy image and re-use it on a second galaxy image.


We’re working to develop new examples


Please, contact Thomas Cecconello ( thomas.cecconello@unimib.it ) for any assistance.