Pytorch 多GPU训练-单运算节点-All you need

2024-02-15 06:47•杂谈•阅读 2242

概述

Pytorch多GPU训练本质上是数据并行，每个GPU上拥有整个模型的参数，将一个batch的数据均分成N份，每个GPU处理一份数据，然后将每个GPU上的梯度进行整合得到整个batch的梯度，用整合后的梯度更新所有GPU上的参数，完成一次迭代。

其中多gpu训练的方案有两种，一种是利用nn.DataParallel实现，这种方法是最早引入pytorch的，使用简单方便，不涉及多进程。另一种是用torch.nn.parallel.DistributedDataParallel 和

torch.utils.data.distributed.DistributedSampler 结合多进程实现，第二种方式效率更高，参考，但是实现起来稍难, 第二种方式同时支持多节点分布式实现。方案二的效率要比方案一高，即使是在单运算节点上，参考pytorch doc:

In the single-machine synchronous case, torch.distributed or the torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data parallelism, including torch.nn.DataParallel():

本篇文章将详细介绍这两种方式的实现,只限于单机上实现，分布式较为复杂，下一篇文章再介绍。

参考:

方案一

步骤

将model用nn.DataParallel wrap.

model = nn.DataParallel(model)

用os.environ["CUDA_VISIBLE_DEVICES"]="0"指定当前程序可以使用GPU设备号，如果不指定将会使用设备上所有的GPU设备。

os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" #使用3个GPU

model.cuda()或者model.to("cuda")和data.cuda()或者data.to("cuda")将模型和数据放入GPU上。

训练过程与使用单GPU一致，使用这种方法，pytorch会自动的将batch数据拆分为N份(N是用os.environ指定的GPU数量)，分别forward，backward，然后自动整合每个GPU上的梯度，在一块GPU上update参数，最后将参数广播给其他GPU，完成一次迭代。

测试

代码：

展开

```python import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader import os

class RandomDataset(Dataset):

def __init__(self, size, length):
    self.len = length
    self.data = torch.randn(length, size)

def __getitem__(self, index):
    return self.data[index]

def __len__(self):
    return self.len

model define

class Model(nn.Module):

# Our model

def __init__(self, input_size, output_size):
    super(Model, self).__init__()
    self.fc = nn.Linear(input_size, output_size)

def forward(self, input):
    output = self.fc(input)
    print("\tIn Model: input size", input.size(),
          "output size", output.size())

    return output

if name=="main":

# Parameters

input_size = 5

output_size = 2

batch_size = 30
data_size = 100

dataset = RandomDataset(input_size, data_size)
# dataloader define
rand_loader = DataLoader(dataset=dataset,
                        batch_size=batch_size, shuffle=True)

# model init
model = Model(input_size, output_size)

# cuda devices
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    # loss

    # backward

    #update
    
    time.sleep(1)#模拟一个比较长的batch时间
    print("Outside: input size", input.size(),
        "output_size", output.size())

torch.save(model.module.state_dict(), "model.pth")

</details>

- 如果使用一块GPU，则测试结果为如下，可以看出模型内部与外部输入输出是一致的。

    In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])