Experiment tracking and observability

Important

AI Runtime for single-node tasks is in Public Preview. The distributed training API for multi-GPU workloads remain in Beta.

This page describes how to use MLflow, view logs, manage model checkpoints, and monitor GPU resources on AI Runtime.

MLflow integration

AI Runtime integrates natively with MLflow for experiment tracking, model logging, and metric visualization.

Setup recommendations:

  • Upgrade MLflow to version 3.7 or newer and follow the deep learning workflow patterns.

  • Enable autologging for PyTorch Lightning:

    import mlflow
    mlflow.pytorch.autolog()
    
  • Customize your MLflow run name by encapsulating your model training code within the mlflow.start_run() API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using the run_name parameter in mlflow.start_run(run_name="your-custom-name") or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name is jobTaskRun-xxxxx.

    from transformers import TrainingArguments
    args = TrainingArguments(
        report_to="mlflow",
        run_name="llama7b-sft-lr3e5",  # <-- MLflow run name
        logging_steps=50,
    )
    
  • The Serverless GPU API automatically launches an MLflow experiment with default name /Users/{WORKSPACE_USER}/{get_notebook_name()}. Users can overwrite it with the environment variable MLFLOW_EXPERIMENT_NAME.Always use absolute paths for the MLFLOW_EXPERIMENT_NAME environment variable:

    import os
    os.environ["MLFLOW_EXPERIMENT_NAME"] = "/Users/<username>/my-experiment"
    
  • Resume previous training by setting the MLFLOW_RUN_ID from the earlier run:

    mlflow.start_run(run_id="<previous-run-id>")
    
  • Set the step parameter in MLFlowLogger to reasonable batch numbers. MLflow has a limit of 10 million metric steps — logging every single batch on large training runs can hit this limit. See Resource limits.

Viewing logs

  • Notebook output — Standard output and errors from your training code appear in the notebook cell output.
  • MLflow logs — The MLflow experiment UI displays training metrics, parameters, and artifacts.

Model checkpointing

Save model checkpoints to Unity Catalog volumes, which provide the same governance as other Unity Catalog objects. Use the following path format to reference files in volumes from a Databricks notebook:

/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>

Save checkpoints to volumes the same way you save them to local storage.

The following example shows how to write a PyTorch checkpoint to Unity Catalog volumes:

import torch

checkpoint = {
    "epoch": epoch,  # last finished epoch
    "model_state_dict": model.state_dict(),  # weights & buffers
    "optimizer_state_dict": optimizer.state_dict(),  # optimizer state
    "loss": loss,  # optional current loss
    "metrics": {"val_acc": val_acc},  # optional metrics
    # Add scheduler state, RNG state, and other metadata as needed.
}
checkpoint_path = "/Volumes/my_catalog/my_schema/model/checkpoints/ckpt-0001.pt"
torch.save(checkpoint, checkpoint_path)

This approach also works for distributed checkpoints. The following example shows distributed model checkpointing with the Torch Distributed Checkpoint API:

import torch.distributed.checkpoint as dcp

def save_checkpoint(self, checkpoint_path):
    state_dict = self.get_state_dict(model, optimizer)
    dcp.save(state_dict, checkpoint_id=checkpoint_path)

trainer.save_checkpoint("/Volumes/my_catalog/my_schema/model/checkpoints")

Monitor GPU resources

Use the GPU resources pane to monitor GPU health and utilization while your code runs on AI Runtime. The pane supports both single-node and multi-node workloads.

To open the pane, connect your notebook to AI Runtime, then click Chip icon. GPU resources in the right side pane.

GPU resources pane showing utilization, memory, and temperature metrics for each GPU.

The pane displays the following metrics for each GPU:

  • GPU utilization percentage
  • GPU memory usage
  • Temperature

The pane polls metrics every 10 seconds and retains up to 2 hours of history. Click Refresh icon. Refresh to fetch the latest values immediately. After 5 minutes of inactivity, the pane pauses; reopen it to resume monitoring.

Multi-user collaboration

  • To ensure all users can access shared code (for example, helper modules or environment YAML files), store them in /Workspace/Shared instead of user-specific folders like /Workspace/Users/<your_email>/.
  • For code that is in active development, use Git folders in user-specific folders /Workspace/Users/<your_email>/ and push to remote Git repos. This allows multiple users to have a user-specific clone and branch, while still using a remote Git repo for version control. See best practices for using Git on Databricks.
  • Collaborators can share and comment on notebooks.

Global limits in Azure Databricks

See Resource limits.