Troubleshooting¶

This guide helps you diagnose and resolve common issues when using Tracelet.

Installation Issues¶

ImportError: No module named 'tracelet'¶

Problem: Tracelet is not installed or not in the Python path.

Solution:

# Install Tracelet
pip install tracelet

# Or with specific backends
pip install tracelet[mlflow,wandb]

# For development
pip install -e ".[dev]"

Backend Import Errors¶

Problem: ImportError: MLflow is not installed or similar for other backends.

Solution: Install the specific backend extras:

pip install tracelet[mlflow]     # For MLflow
pip install tracelet[clearml]    # For ClearML
pip install tracelet[aim]        # For AIM
pip install tracelet[all]        # For all backends

Connection Issues¶

MLflow Server Connection Failed¶

Problem: Cannot connect to MLflow tracking server.

Diagnosis:

import mlflow
print(f"MLflow tracking URI: {mlflow.get_tracking_uri()}")

Solutions:

# Start local MLflow server
mlflow server --host 127.0.0.1 --port 5000

# Or set tracking URI
export MLFLOW_TRACKING_URI=http://localhost:5000

# In code
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")

W&B Authentication Issues¶

Problem: wandb.errors.UsageError: api_key not configured

Solution:

# Login to W&B
wandb login

# Or set API key
export WANDB_API_KEY=your_api_key_here

ClearML Offline Mode¶

Problem: ClearML requires internet connection.

Solution (for testing/CI):

import os
os.environ["CLEARML_WEB_HOST"] = ""
os.environ["CLEARML_API_HOST"] = ""
os.environ["CLEARML_FILES_HOST"] = ""

Metric Logging Issues¶

Metrics Not Appearing in Backend¶

Problem: TensorBoard metrics logged but not showing in MLflow/W&B.

Diagnosis:

# Check if Tracelet is active
import tracelet
print(f"Active experiment: {tracelet.get_active_experiment()}")
print(f"Active backends: {tracelet.get_active_backends()}")

Solutions:

Ensure tracelet.start_logging() is called before creating SummaryWriter
Check backend configuration:

tracelet.start_logging(
    exp_name="test",
    backend="mlflow",
    config={"track_tensorboard": True}  # Ensure this is True
)

Duplicate Metrics¶

Problem: Same metrics appearing multiple times.

Cause: Multiple experiment tracking tools running simultaneously.

Solution: Use only Tracelet for experiment tracking:

# Don't use multiple loggers simultaneously
# writer = SummaryWriter()  # Tracelet handles this
# mlflow.log_metric()       # Avoid direct backend calls
# wandb.log()              # Let Tracelet route metrics

Missing Lightning Metrics¶

Problem: PyTorch Lightning metrics not captured.

Solutions:

Start Tracelet before creating Trainer:

tracelet.start_logging("experiment", backend="mlflow")
trainer = pl.Trainer()  # Create after start_logging

Enable Lightning tracking:

tracelet.start_logging(
    "experiment",
    backend="mlflow",
    config={"track_lightning": True}
)

Performance Issues¶

High Memory Usage¶

Problem: Tracelet consuming too much memory.

Solutions:

# Reduce system monitoring frequency
tracelet.start_logging(
    "experiment",
    backend="mlflow",
    config={
        "track_system": False,          # Disable system monitoring
        "metrics_interval": 60.0,       # Reduce frequency
        "max_image_size": "512KB",      # Limit image sizes
    }
)

# In training loop - reduce logging frequency
if step % 100 == 0:  # Log less frequently
    writer.add_histogram('weights', model.weights, step)

Slow Training¶

Problem: Training significantly slower with Tracelet.

Diagnosis:

import time

# Test with and without Tracelet
start = time.time()
# Your training step
elapsed = time.time() - start
print(f"Training step took {elapsed:.3f}s")

Solutions:

Reduce logging frequency:

# Log metrics less frequently
if step % 10 == 0:  # Instead of every step
    writer.add_scalar('loss', loss, step)

Disable expensive operations:

config = {
    "track_system": False,      # Disable system monitoring
    "track_git": False,         # Disable git tracking
    "capture_histograms": False # Disable histogram capture
}

Network Timeouts¶

Problem: Timeouts when logging to cloud backends.

Solutions:

Use local backend as fallback:

try:
    tracelet.start_logging("exp", backend="wandb")
except Exception:
    print("W&B failed, falling back to MLflow")
    tracelet.start_logging("exp", backend="mlflow")

Configure timeout settings:

import wandb
wandb.Settings(base_url="https://api.wandb.ai", timeout=60)

Platform-Specific Issues¶

Windows Path Issues¶

Problem: File path errors on Windows.

Solution:

import os
from pathlib import Path

# Use pathlib for cross-platform paths
log_dir = Path("./runs") / "experiment_1"
writer = SummaryWriter(log_dir=str(log_dir))

M1 Mac Compatibility¶

Problem: Some backends not working on Apple Silicon.

Solutions:

Install x86_64 version:

arch -x86_64 pip install tracelet[all]

Use conda-forge:

conda install -c conda-forge tracelet

Docker Container Issues¶

Problem: Backends not accessible from Docker.

Solution:

# In Dockerfile
ENV MLFLOW_TRACKING_URI=http://host.docker.internal:5000
ENV WANDB_API_KEY=your_api_key

# For MLflow server access
EXPOSE 5000

Debugging Tips¶

Enable Debug Logging¶

import logging
logging.basicConfig(level=logging.DEBUG)

# Or Tracelet-specific logging
tracelet_logger = logging.getLogger('tracelet')
tracelet_logger.setLevel(logging.DEBUG)

Check Configuration¶

# Print current configuration
experiment = tracelet.get_active_experiment()
print(f"Experiment: {experiment}")
print(f"Config: {experiment.config if experiment else 'No active experiment'}")

Test Backend Connectivity¶

def test_backend(backend_name):
    """Test if backend is accessible"""
    try:
        tracelet.start_logging(f"test_{backend_name}", backend=backend_name)
        experiment = tracelet.get_active_experiment()
        experiment.log_metric("test_metric", 1.0, 0)
        print(f"✅ {backend_name} working")
        tracelet.stop_logging()
        return True
    except Exception as e:
        print(f"❌ {backend_name} failed: {e}")
        return False

# Test all backends
for backend in ["mlflow", "wandb", "clearml", "aim"]:
    test_backend(backend)

Getting Help¶

Check System Information¶

import tracelet
print(tracelet.__version__)

import sys
print(f"Python: {sys.version}")

import torch
print(f"PyTorch: {torch.__version__}")

Minimal Reproduction¶

When reporting issues, provide a minimal example:

import tracelet
from torch.utils.tensorboard import SummaryWriter

# Minimal failing example
tracelet.start_logging("test_experiment", backend="mlflow")
writer = SummaryWriter()
writer.add_scalar("test", 1.0, 0)
writer.close()
tracelet.stop_logging()

Common Error Messages¶

Error Message	Likely Cause	Solution
`ModuleNotFoundError: No module named 'mlflow'`	Backend not installed	`pip install tracelet[mlflow]`
`ConnectionError: HTTPConnectionPool`	Backend server not running	Start MLflow server
`wandb.errors.UsageError: api_key not configured`	W&B not authenticated	`wandb login`
`AttributeError: 'NoneType' object has no attribute 'log_metric'`	No active experiment	Call `start_logging()` first
`RuntimeError: CUDA out of memory`	GPU memory exhausted	Reduce batch size or disable system tracking

For additional help, please:

Check our GitHub Issues
Join our Discussions
Email us at support@tracelet.io