通过累积多个小批次的梯度来模拟大批次训练的内存优化技术
有效批次大小 = 微批次大小 × 累积步数 × GPU数量 Effective Batch = Micro Batch × Accumulation Steps × Num GPUs
# 正确的实现 for step in range(accumulation_steps): loss = model(batch[step]) loss = loss / accumulation_steps # 关键:损失缩放 loss.backward() optimizer.step() optimizer.zero_grad()
# 线性缩放规则 effective_lr = base_lr * accumulation_steps * batch_size * num_gpus
# DDP示例 if (step + 1) % accumulation_steps == 0: # 梯度同步发生在这里 optimizer.step() optimizer.zero_grad()
# 配合其他技术 gradient_accumulation_steps = 8 gradient_checkpointing = True # 进一步节省内存 mixed_precision = "fp16" # 减少内存占用