Trace concepts

Tracing is an observability technique that captures the complete execution flow of a request through your application. Unlike traditional logging that records isolated events, tracing creates a detailed map of how data flows through your systems and records every operation along the way.

GenAI applications run complex, multi-step workflows that combine multiple components such as LLMs, retrievers, tools, and agents. Tracing makes those workflows debuggable by capturing the full execution flow.

Trace structure

An MLflow Trace comprises two primary objects:

  1. Trace.info of type TraceInfo: Metadata describing the trace's origin, status, and execution time. TraceInfo also holds tags. The tags are user-, session-, and developer-provided key-value pairs that you can use to search or filter traces.

  2. Trace.data of type TraceData: The actual payload containing instrumented Span objects that capture your application's step-by-step execution from input to output.

Trace Architecture

MLflow Traces are compatible with OpenTelemetry specifications, a widely adopted industry standard for observability. Traces remain interoperable with other OpenTelemetry-compatible observability tools, while MLflow extends the OpenTelemetry model with GenAI-specific structures and attributes.

TraceInfo

TraceInfo provides lightweight metadata about the overall trace. Key fields include:

Field Description
trace_id Unique identifier for the trace
trace_location Where the trace is stored (MLflow Experiment or Databricks Inference Table)
request_time Start time of the trace in milliseconds
state Trace status: OK, ERROR, IN_PROGRESS, or STATE_UNSPECIFIED
execution_duration Duration of the trace in milliseconds
request_preview JSON-encoded preview of the input (root span input)
response_preview JSON-encoded preview of the output (root span output)
tags Key-value pairs for filtering and searching traces

TraceData

The TraceData object is a container of Span objects where the execution details are stored. Each span captures information about a specific operation, including:

  • Requests and responses
  • Latency measurements
  • LLM messages and tool parameters
  • Retrieved documents and context
  • Metadata and attributes

Spans form a hierarchical structure through parent-child connections, creating a tree that represents your application's execution flow.

Span Architecture

Tags

Tags are mutable key-value pairs attached to traces for organization and filtering. MLflow defines standard tags for common use cases:

  • mlflow.trace.session: Session identifier for grouping related traces
  • mlflow.trace.user: User identifier for tracking per-user interactions
  • mlflow.source.name: Entry point or script that generated the trace
  • mlflow.source.git.commit: Git commit hash of the source code (if applicable)
  • mlflow.source.type: Source type (PROJECT, NOTEBOOK, etc.)

You can also add custom tags for your specific needs. Learn more in Add context to traces and Attach custom tags / metadata.

Storage layout

MLflow optimizes trace storage for performance and cost. To customize the storage location, attach a Unity Catalog volume when creating an experiment. Access is then governed by Unity Catalog volume privileges.

TraceInfo is stored directly in a relational database as indexed rows, which enables fast queries for searching and filtering traces.

TraceData (the spans) is stored in artifact storage rather than the relational database because spans are larger. This keeps queries fast even when trace volume grows.

Active vs. finished traces

An active trace is a trace that MLflow is currently writing, for example, while a function decorated with @mlflow.trace is running. After the decorated function exits, the trace is finished, but you can still annotate it with new data.

To work with active or recent traces, use these methods:

Next steps