Welcome to the documentation for NEANIAS Distributed Multi-GPU training of large ML models using Horovod ======================================================================================================== .. toctree:: :maxdepth: 2 :caption: Contents: About ----- The training of deep neural networks on big data takes significant time, even on GPU-enabled workstations. To increase efficiency, a distributed computation cluster should be used, where users can define models to train and collect results. This service is a horovod based architecture of GPU-accelerated nodes, where the communication is based on the MPI protocol. The training jobs are queued, resulting in model parameters, and training logs are served to the user. Real-time tracking of the training process is possible, e.g. using TensorBoard. The current resources: * 4 Virtual Machine * 2 vCPU and 4GB RAM per node * 200GB shared storage Endpoint -------- The service is available at http://90.147.152.68:8888 . Access ------ To get access to the service, please contact Attila Farkas ( attila.farkas@sztaki.hu ). Usage ----- After a successful login, the Horovod cluster can be used in the JupyterLab environment. Please use the ``/horovod`` folder for the training because this folder is shared between the Horovod nodes. The address of the Horovod cluster nodes for the distributed training can be found in the ``/horovod/horovod_nodes`` file. Documentation on Jupyter can be found at https://jupyterlab.readthedocs.io/en/stable/index.html . Documentation on Horovod can be found at https://horovod.readthedocs.io/en/stable/index.html . Contact ------- Please, contact Attila Farkas ( attila.farkas@sztaki.hu ) for any assistance.