Welcome to the documentation for NEANIAS Distributed Multi-GPU training of large ML models using Horovod¶
About¶
The training of deep neural networks on big data takes significant time, even on GPU-enabled workstations. To increase efficiency, a distributed computation cluster should be used, where users can define models to train and collect results. This service is a horovod based architecture of GPU-accelerated nodes, where the communication is based on the MPI protocol. The training jobs are queued, resulting in model parameters, and training logs are served to the user. Real-time tracking of the training process is possible, e.g. using TensorBoard.
The current resources:
- 4 Virtual Machine
- 2 vCPU and 4GB RAM per node
- 200GB shared storage
Endpoint¶
The service is available at http://90.147.152.68:8888 .
Access¶
To get access to the service, please contact Attila Farkas ( attila.farkas@sztaki.hu ).
Usage¶
After a successful login, the Horovod cluster can be used in the JupyterLab environment.
Please use the /horovod
folder for the training because this folder is shared between the Horovod nodes.
The address of the Horovod cluster nodes for the distributed training can be found in the /horovod/horovod_nodes
file.
Documentation on Jupyter can be found at https://jupyterlab.readthedocs.io/en/stable/index.html .
Documentation on Horovod can be found at https://horovod.readthedocs.io/en/stable/index.html .
Contact¶
Please, contact Attila Farkas ( attila.farkas@sztaki.hu ) for any assistance.