Welcome to the documentation for NEANIAS Distributed Multi-GPU training of large ML models using Horovod

About

The training of deep neural networks on big data takes significant time, even on GPU-enabled workstations. To increase efficiency, a distributed computation cluster should be used, where users can define models to train and collect results. This service is a horovod based architecture of GPU-accelerated nodes, where the communication is based on the MPI protocol. The training jobs are queued, resulting in model parameters, and training logs are served to the user. Real-time tracking of the training process is possible, e.g. using TensorBoard.

The current resources:

  • 4 Virtual Machine
  • 2 vCPU and 4GB RAM per node
  • 200GB shared storage

Endpoint

The service is available at http://90.147.152.68:8888 .

Access

To get access to the service, please contact Attila Farkas ( attila.farkas@sztaki.hu ).

Usage

After a successful login, the Horovod cluster can be used in the JupyterLab environment.

Please use the /horovod folder for the training because this folder is shared between the Horovod nodes.

The address of the Horovod cluster nodes for the distributed training can be found in the /horovod/horovod_nodes file.

Documentation on Jupyter can be found at https://jupyterlab.readthedocs.io/en/stable/index.html .

Documentation on Horovod can be found at https://horovod.readthedocs.io/en/stable/index.html .

Contact

Please, contact Attila Farkas ( attila.farkas@sztaki.hu ) for any assistance.