본문 바로가기

Programming

pytorch 1.0, distributed training 관련 기록

pytorch 1.0이 release 됐고, 관련 내용들을 적어놓은 blog post에 대한 기록

https://code.fb.com/ai-research/pytorch-developer-ecosystem-expands-1-0-stable-release/


pytorch 관련 projects

  1. Horovod - a distributed training framework that makes it easy for developers to take a single-GPU program and quickly train it on multiple GPUs
  2. Pytorch Geometry - a geometric computer vision library for PyTorch that provides a set of routines and differentiable modules.
  3. TensorBoardX - a module for logging PyTorch models to TensorBoard, allowing developers to use the visualization tool for model training.
  4. Translate - a library for training sequence-to-sequence models that's based on Facebook's machine translation systems (FairSeq)

Cloud

AWS, Azure, Google Cloud Platform 에서도 사용 가능하다.

Horovod

아래는 관련 논문 (?)

  1. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour - Facebook
  2. Horovod: fast and easy distributed deep learning in TensorFlow - Uber
관련 blog 글
  1. https://eng.uber.com/horovod/

Figure above: The "data parallel" approach to distributed training involves splitting up the data and training on multiple nodes in parallel. In synchronous cases, the gradients for different batches of data are calculated separately on each node but averaged across nodes to apply consistent updates to the model copy in each node.



재밌는 듯. 처음에 봤던 gpu 병렬처리를 통해서 학습을 하는 내용을 봤던 건 A3C 공부하면서 봤던 Hogwild! 였는데, Hogwild!는 asynchronous 하고 위의 Horovd 는 sync 해서 사용하는 듯 하기도 하고.

PyTorch 1.0에서는 hogwild API에 있던 warning이 사라진 걸 보면 조심스럽게 사용해도 될 듯 하다.