Maximize performance
5 minute read
This guide covers performance considerations, CPU/GPU synchronization patterns, and best practices for efficient W&B logging.
Performance overview
Under normal usage (logging less than once per second), W&B adds minimal overhead to your training. The SDK is designed to handle various logging patterns efficiently while avoiding common performance pitfalls.
CPU and GPU Synchronization
The challenge
When training on GPU devices, you often want to log metrics without forcing CPU-GPU synchronization. Forcing the CPU to wait for GPU computations can significantly slow down training.
How W&B handles GPU data
W&B’s architecture avoids most synchronization issues:
# Good: This doesn't block on GPU computation
loss = model(x) # GPU operation
wandb.log({"loss": loss}) # Returns immediately
# The actual value is only read when needed
Recommended patterns for GPU logging
Pattern 1: Batch logging
Instead of logging every step, accumulate metrics and log periodically:
losses = []
for batch_idx, (x, y) in enumerate(dataloader):
x, y = x.to(device), y.to(device)
y_pred = model(x)
loss = criterion(y_pred, y)
# Accumulate without forcing synchronization
losses.append(loss)
# Log every N batches
if batch_idx % log_interval == 0:
# This forces synchronization only every N batches
wandb.log({
"loss": torch.stack(losses).mean().item(),
"batch": batch_idx
})
losses = []
optimizer.zero_grad()
loss.backward()
optimizer.step()
Pattern 2: Deferred logging
For truly asynchronous logging, defer metric computation:
# Store references without forcing computation
gpu_metrics = []
for batch_idx, (x, y) in enumerate(dataloader):
x, y = x.to(device), y.to(device)
y_pred = model(x)
loss = criterion(y_pred, y)
# Store the tensor reference (no sync)
gpu_metrics.append({
"batch": batch_idx,
"loss": loss.detach() # Detach from computation graph
})
optimizer.zero_grad()
loss.backward()
optimizer.step()
# After epoch, log all metrics at once
for metric in gpu_metrics:
wandb.log({
"loss": metric["loss"].item(), # CPU sync happens here
"batch": metric["batch"]
})
Performance guidelines
What you should know
-
Logging Frequency: W&B handles logging rates up to a few times per second efficiently. For higher frequencies, batch your metrics.
-
Data Size: Each log call should contain at most a few megabytes of data. Sample or summarize large tensors.
-
Network Failures: W&B handles network issues with exponential back-off and retry logic. Network issues won’t interrupt your training.
What not to worry about
- Network Latency: Uploads happen in a background process and won’t block training
- Disk I/O: W&B uses efficient buffering to minimize disk operations
- Server Availability: Local logging continues even if servers are unavailable
- Memory Usage: W&B automatically manages buffer sizes to prevent memory issues
Common performance pitfalls
1. Excessive logging frequency
# Bad: Logging too frequently
for i in range(1000):
wandb.log({"metric": i}) # Don't do this in a tight loop
# Good: Batch your logging
metrics = []
for i in range(1000):
metrics.append(i)
wandb.log({"metrics": metrics})
2. Large data volumes
# Bad: Logging large tensors directly
wandb.log({"huge_tensor": model.state_dict()}) # Don't log entire models
# Good: Log summaries or samples
wandb.log({"weight_norm": compute_weight_norm(model)})
3. Forced synchronization
# Bad: Forces GPU-CPU sync every iteration
for batch in dataloader:
loss = model(batch)
wandb.log({"loss": loss.item()}) # .item() forces sync
# Good: Log less frequently
if step % 100 == 0:
wandb.log({"loss": loss.item()})
Best practices
1. Log strategically
# Log important metrics every N steps
if step % args.log_interval == 0:
wandb.log({
"train/loss": loss.item(),
"train/accuracy": accuracy,
"train/learning_rate": scheduler.get_last_lr()[0],
"train/epoch": epoch,
})
2. Use histograms for distributions
Instead of logging individual values, use histograms:
# Instead of logging each gradient
wandb.log({"gradients": wandb.Histogram(gradients)})
3. Profile your logging
If you’re concerned about performance, profile your logging:
import time
# Measure logging overhead
start = time.time()
wandb.log({"metric": value})
log_time = time.time() - start
if log_time > 0.001: # If logging takes more than 1 ms
print(f"Warning: Logging took {log_time:.3f}s")
4. Use tables for structured data
For complex data, use W&B Tables which are optimized for performance:
# Log predictions efficiently
if epoch % val_interval == 0:
table = wandb.Table(columns=["image", "prediction", "truth"])
for img, pred, truth in val_samples:
table.add_data(wandb.Image(img), pred, truth)
wandb.log({"predictions": table})
5. Configure buffer sizes
For high-frequency logging, you can adjust buffer sizes:
# Increase buffer size for high-frequency logging
wandb.init(
project="high-freq-experiment",
settings=wandb.Settings(
_stats_sample_rate_seconds=0.1, # Sample system stats more frequently
_internal_queue_size=10000, # Larger internal queue
)
)
Performance benchmarks
Typical overhead measurements:
- wandb.log() call: < 1ms
- Memory overhead: ~50-200 MB depending on buffer configuration
- CPU usage: < 1% for typical logging patterns
- Network bandwidth: Automatically throttled to avoid interference
Optimizing for different scenarios
High-frequency training (RL, online learning)
# Buffer multiple steps before logging
step_buffer = []
for step in range(total_steps):
action = agent.act(state)
next_state, reward, done = env.step(action)
step_buffer.append({
"reward": reward,
"step": step
})
# Log every 100 steps
if step % 100 == 0:
avg_reward = sum(s["reward"] for s in step_buffer) / len(step_buffer)
wandb.log({
"avg_reward": avg_reward,
"step": step
})
step_buffer = []
Large-scale distributed training
# Only log from rank 0 to avoid duplicate logging
if torch.distributed.get_rank() == 0:
wandb.log({
"loss": loss.item(),
"global_step": global_step
})
Memory-constrained environments
# Use minimal logging configuration
wandb.init(
project="memory-constrained",
settings=wandb.Settings(
_disable_stats=True, # Disable system metrics
_internal_queue_size=100, # Smaller queue
)
)
# Log only essential metrics
if step % 1000 == 0: # Very infrequent logging
wandb.log({"loss": loss.item()})
Summary
For optimal W&B performance:
- Log at reasonable frequencies (< few times per second)
- Batch high-frequency metrics to reduce overhead
- Avoid forcing GPU-CPU synchronization when not needed
- Use W&B’s specialized data types (Histogram, Table, Image) for complex data
- Profile your logging if you have performance concerns
The SDK is designed to handle the complexity of distributed logging so you can focus on your experiments without worrying about logging infrastructure.
Related resources
- SDK Architecture - Understanding W&B’s event-driven architecture
- Logging Reference - Complete logging API documentation
- Configuration Options - Advanced configuration settings
Feedback
Was this page helpful?
Glad to hear it! If you have more to say, please let us know.
Sorry to hear that. Please tell us how we can improve.