PyTorch Lightning Integration¶
Tracelet automatically captures PyTorch Lightning training metrics without any code modifications.
Overview¶
The Lightning integration hooks into the Lightning framework's logging system to capture all metrics logged via self.log()
calls in your LightningModule.
Supported Features¶
- Training Metrics - Loss, accuracy, custom metrics
- Validation Metrics - Validation loss, metrics from validation_step
- Test Metrics - Test phase metrics
- Hyperparameters - Model and trainer configuration
- System Metrics - GPU utilization, memory usage during training
Basic Usage¶
import tracelet
import pytorch_lightning as pl
from pytorch_lightning import Trainer
# Start Tracelet before creating trainer
tracelet.start_logging(
exp_name="lightning_experiment",
project="my_project",
backend="mlflow"
)
# Define your LightningModule as usual
class MyModel(pl.LightningModule):
def training_step(self, batch, batch_idx):
loss = self.compute_loss(batch)
# All these metrics are automatically captured by Tracelet
self.log('train/loss', loss)
self.log('train/accuracy', self.compute_accuracy(batch))
return loss
def validation_step(self, batch, batch_idx):
val_loss = self.compute_loss(batch)
val_acc = self.compute_accuracy(batch)
# Validation metrics are also captured
self.log('val/loss', val_loss)
self.log('val/accuracy', val_acc)
# Train your model - metrics automatically tracked
model = MyModel()
trainer = Trainer(max_epochs=10)
trainer.fit(model, train_dataloader, val_dataloader)
# Stop tracking
tracelet.stop_logging()
Advanced Configuration¶
# Configure Lightning-specific tracking
tracelet.start_logging(
exp_name="advanced_lightning",
backend=["mlflow", "wandb"], # Multi-backend logging
config={
"track_lightning": True, # Enable Lightning integration
"track_system": True, # Monitor system resources
"track_git": True, # Track git information
"metrics_interval": 10.0, # System metrics every 10 seconds
}
)
Best Practices¶
Metric Naming¶
Use consistent, hierarchical naming:
def training_step(self, batch, batch_idx):
# Good: Hierarchical naming
self.log('train/loss', loss)
self.log('train/accuracy', accuracy)
self.log('train/f1_score', f1)
# Also good: Phase-specific metrics
self.log('metrics/train_loss', loss)
self.log('metrics/train_acc', accuracy)
Logging Frequency¶
Control when metrics are logged:
def training_step(self, batch, batch_idx):
loss = self.compute_loss(batch)
# Log every step
self.log('train/loss', loss, on_step=True, on_epoch=False)
# Log epoch averages
self.log('train/epoch_loss', loss, on_step=False, on_epoch=True)
# Log both
self.log('train/loss_detailed', loss, on_step=True, on_epoch=True)
Custom Metrics¶
Log custom metrics and hyperparameters:
def on_train_start(self):
# Log hyperparameters
self.logger.log_hyperparams({
'learning_rate': self.learning_rate,
'batch_size': self.batch_size,
'model_name': self.__class__.__name__
})
def training_step(self, batch, batch_idx):
# Custom metrics
predictions = self.forward(batch)
custom_metric = self.compute_custom_metric(predictions, batch)
self.log('custom/my_metric', custom_metric)
Multi-GPU Support¶
Tracelet works seamlessly with Lightning's distributed training:
# Works with DDP, DDP2, etc.
trainer = Trainer(
accelerator='gpu',
devices=4,
strategy='ddp'
)
# Metrics from all processes are automatically aggregated
trainer.fit(model)
Integration with Callbacks¶
Use with Lightning callbacks:
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
tracelet.start_logging("lightning_with_callbacks", backend="clearml")
trainer = Trainer(
callbacks=[
ModelCheckpoint(monitor='val/loss'),
EarlyStopping(monitor='val/loss', patience=3)
],
max_epochs=100
)
trainer.fit(model)
Troubleshooting¶
Common Issues¶
Metrics not appearing: Ensure tracelet.start_logging()
is called before creating the Trainer.
Duplicate metrics: If using multiple loggers, you may see duplicate entries. Use Tracelet as the primary logger.
Memory issues with large models: Enable gradient checkpointing and reduce logging frequency for memory-intensive operations.