Last updated:
0 purchases
pipegoose 0.2.0
🚧 PipeGoose: Training any 🤗 transformers in Megatron-LM 3D parallelism and ZeRO-1 out of the box
Honk honk honk! This project is actively under development. Check out my learning progress here.
⚠️ The project is actively under development and not ready for use.
⚠️ The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.
⚠️ Support for hybrid 3D parallelism and distributed optimizer for 🤗 transformers will be available in the upcoming weeks (it's basically done, but it doesn't support 🤗 transformers yet)
⚠️ **This library is underperforming compared to Megatron-LM and DeepSpeed (not even achieving reasonable performance). We're actively pushing it to reach 180 TFLOPs and go beyond Megatron-LM **
from torch.utils.data import DataLoader
+ from torch.utils.data.distributed import DistributedSampler
from torch.optim import SGD
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
+ from pipegoose.distributed import ParallelContext, ParallelMode
+ from pipegoose.nn import DataParallel, TensorParallel
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer.pad_token = tokenizer.eos_token
BATCH_SIZE = 4
+ DATA_PARALLEL_SIZE = 2
+ parallel_context = ParallelContext.from_torch(
+ tensor_parallel_size=2,
+ data_parallel_size=2,
+ pipeline_parallel_size=1
+ )
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = DataParallel(model, parallel_context).parallelize()
model.to("cuda")
+ device = next(model.parameters()).device
optim = SGD(model.parameters(), lr=1e-3)
dataset = load_dataset("imdb", split="train")
+ dp_rank = parallel_context.get_local_rank(ParallelMode.DATA)
+ sampler = DistributedSampler(dataset, num_replicas=DATA_PARALLEL_SIZE, rank=dp_rank, seed=42)
+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE // DATA_PARALLEL_SIZE, shuffle=False, sampler=sampler)
for epoch in range(100):
+ sampler.set_epoch(epoch)
for batch in dataloader:
inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=1024, return_tensors="pt")
inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
labels = inputs["input_ids"]
outputs = model(**inputs, labels=labels)
optim.zero_grad()
outputs.loss.backward()
optim.step()
Installation and try it out
You can install the package through the following command:
pip install pipegoose
And try out a hybrid tensor and data parallelism training script.
cd pipegoose/examples
torchrun --standalone --nnodes=1 --nproc-per-node=4 hybrid_parallelism.py
We did a small scale for correctness testing by run comparing the training losses between a paralleized transformers and one kept by default, start at identical checkpoint and training data. We will conduct rigorous large scale convergence and weak scaling law benchmarks against megatron and deepspeed in the near future
Data Parallelism [link]
Tensor Parallelism
Hybrid 2D Parallelism
Features
Megatron-style 3D parallelism
ZeRO-1: Distributed BF16 Optimizer
Highly optimized CUDA kernels port from Megatron-LM, DeepSpeed
...
Implementation Details
Supports training transformers model in Megatron 3D parallelism and ZeRO-1 (write from scratch).
Implements parallel compute and data transfer using separate CUDA streams.
Gradient checkpointing will be implemented by enforcing virtual dependency in the backpropagation graph, ensuring that the activation for gradient checkpoint will be recomputed just in time for each (micro-batch, partition).
Custom algorithms for model partitioning with two default partitioning models based on elapsed time and GPU memory consumption per layer.
Potential support includes:
Callbacks within the pipeline: Callback(function, microbatch_idx, partition_idx) for before and after the forward, backward, and recompute steps (for gradient checkpointing).
Mixed precision training.
Appreciation
Big thanks to 🤗 Hugging Face for sponsoring this project with 8x A100 GPUs for testing! And Zach Schrier for monthly twitch donations
The library's APIs are inspired by OSLO's and ColossalAI's APIs.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.