Reducing Activation Recomputation in Large Transformer Models
— Aran Komatsuzaki (@arankomatsuzaki) May 12, 2022
Achieves a Model FLOPS Utilization of 54.2% (baseline: 42.1%) when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs.https://t.co/zt6LPIthTj pic.twitter.com/aS6SszQAEc