MDP agents freeze mid-job with complete loss of output — random freeze point, empty diagnostic logs, both pools affected

Question

MDP agents freeze mid-job with complete loss of output — random freeze point, empty diagnostic logs, both pools affected

Pall Bjornsson 20

Tags: azure-devops, managed-devops-pools, azure-pipelines, vmss

Since approximately 2026-03-27, our Managed DevOps Pool agents have been randomly freezing mid-job. The agent stops producing log output entirely, ADO receives no further heartbeat, and the job stays in "Running" state until the pipeline timeout kills it. This affects both our MDP pools independently.

ENVIRONMENT

JDNextBuild-Set (primary concern):

VM SKU: Standard_D4s_v3
OS image: Ubuntu 24.04 (custom Azure Image Builder image)
Agent version: 4.270.0 (recently also 4.271.0)
Region: East US
Network: VNet-injected, dedicated subnet
Max concurrency: 15

Terraform-Set:

VM SKU: Standard_D2s_v3
OS image: Ubuntu 24.04 (custom Azure Image Builder image)
Agent version: 4.270.0 (recently also 4.271.0)
Region: East US
Network: VNet-injected, dedicated subnet
Max concurrency: 10

Both pools are on separate subnets, separate VM SKUs, and completely different software stacks.

SYMPTOMS

Random freeze point: the agent can freeze at any step. On JDNextBuild-Set we have seen freezes during Docker builds, helm upgrade, kubectl operations, and PowerShell steps expected to complete in under 2 seconds. There is no consistent step or pattern.
No error output: the last log line is a normal, successful operation. No exception, exit code, or warning precedes the freeze.
Agent heartbeat stops: ADO receives no further communication. The job sits in "Running" until the timeout fires.
Agent diagnostic logs folder is empty: downloading the job logs and opening the "Agent Diagnostic Logs" folder shows no files whatsoever. This indicates the issue occurs before the agent completes its lifecycle normally.
MDPResourceLog shows no errors: checked in Log Analytics, no errors or warnings appear for either pool during affected periods.
MDP provisioning metrics are clean: no CustomScriptError or provisioning failures in the Metrics blade.

EXAMPLE FROM RUN LOGS

From a Terraform-Set job:

2026-03-29T02:00:06Z Starting: Terraform Plan - bootstrap

...normal output...

[complete silence for 49 minutes 57 seconds]

2026-03-29T02:50:03Z Finishing: Finalize Job

2026-03-29T02:50:03Z The job running on agent Terraform-Set 4 stopped responding...

From a JDNextBuild-Set job:

2026-03-28T17:10:11Z Agent version: 4.270.0

...normal job output...

[freeze - no output, no error]

Agent Diagnostic Logs folder: empty (0 files)

WHAT WE HAVE RULED OUT

Agent version change: 4.270.0 has been in use for months with no issue prior to 2026-03-27. No version change coincides with onset.
Image-specific cause: both pools have completely different software (JDNextBuild: Docker, Node.js, .NET, helm, kubectl; Terraform: Terraform, PowerShell Az modules). Same symptom on both rules out anything image-related.
Specific task: freezes occur at random points including trivial 2-second tasks. No single step is consistently involved.
Burstable SKU throttling: JDNextBuild-Set is on non-burstable D4s_v3. Terraform-Set was moved from B2ms to D2s_v3; the issue predates and persists after this change.
Custom DNS / networking: Azure-provided DNS, standard VNet with NSG. No custom DNS or unusual routing.
No changes on our end: no image updates, pipeline changes, or infrastructure changes coincide with issue onset.

OBSERVATIONS

We have seen MDP pool report the agent as ready even if the pipeline it was running and is supposed to be running is not yet timed out. And we have seen MDP deprovision that agent while pipeline still reported running.

We have also seen MDP unable to provision more agents even if both concurrent self hosted jobs as well as max pool capacity allow. In that state, there's an MDP agent in the pool as Ready, but doesn't get allocated a pending job, until the failing one times out.

SIMILAR REPORT FOUND

A very similar problem was reported for MS-hosted agents on 2026-03-24:

https://learn.microsoft.com/en-us/answers/questions/5835366/ms-hosted-build-agent-acquisition-is-stuck-(outage

Key similarities:

Agent stuck with no meaningful error
Agent Diagnostic Logs folder is empty - identical to our MDP case
Azure status page showed all green throughout
Self-resolved after ~24 hours with no action taken
Accepted answer: "none of the advice here helped, it just randomly started working again after 24h"

Our issue differs in that it has persisted for multiple days across two independent pools rather than resolving on its own and is getting gradually worse by the day.

RELATED GITHUB ISSUES

#5491 - Agent does not stop / timeout not enforced when agent freezes https://github.com/microsoft/azure-pipelines-agent/issues/5491 Confirms that timeoutInMinutes relies on active agent heartbeat; frozen agents bypass timeout enforcement entirely.
#5290 - --once mode silently drops jobs on scheduling API error https://github.com/microsoft/azure-pipelines-agent/issues/5290 Relevant as MDP uses --once for ephemeral agent lifecycle.

WORKAROUNDS APPLIED (do not resolve root cause)

Reduced pipeline job timeouts (plan: 75 min, apply: 120 min) to free up concurrency slots faster after a freeze
Increased MDP pool concurrency to provide more headroom while frozen agents hold slots

QUESTIONS

Has anyone else experienced this pattern with MDP agents specifically (not MS-hosted)?
Is there any way to get better diagnostics from inside the MDP VM when a freeze occurs? We cannot access MDP VMs directly as they run in Microsoft's managed subscription.
Is there a known platform issue with MDP in East US around this timeframe and ongoing?

Any insight appreciated

Pall Bjornsson 20 Reputation points

2026-04-01T17:24:25.8833333+00:00

Just want to mention that we switched to MDP agents approximately 3 weeks ago. They have been working and running perfectly, until that last Friday.

Yesterday it was so bad that I had to do 5 attempts until I got one of my pipelines to succeed.

Today, it's similar. I have to watch every pipeline run and retry the failing ones. 1-3 retries usually succeed, but not always.

Palli
Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-04-02T22:56:55.3633333+00:00
@Pall Bjornsson ,

Here are some diagnostics and next steps you can try, plus pointers to the MDP docs.

Validate your network connectivity

Run the “Validate MDP Pool Connectivity” PowerShell script on a VM in the same subnet as your agents (see Configure Managed DevOps Pools networking → Validate endpoint connectivity). Even intermittent drops or DNS failures can make the agent process hang without logging. If any endpoints fail, work with your network team to open them.

Turn on and ship MDP diagnostic logs

In the Azure portal, go to your MDP pool → Monitor → Diagnostic settings.

Create a diagnostic setting to send logs to a Log Analytics workspace or storage account (see Troubleshoot Managed DevOps Pools issues → Diagnostic logs).

After a freeze, query the Log Analytics table for any AgentProvisioning or AgentLifecycle records around that time.

Enable interactive mode on the pool

In the pool’s Security tab, turn on EnableInteractiveMode (see How to enable interactive mode for agents in MDP).

When an agent goes into that stuck state, you can SSH into the VM to inspect /var/az_agent, kernel logs, network stats, etc.

Turn up agent and task logging

In your pipeline YAML, set variables:

system.debug: true AGENT_TRACE: 1

Or export VSTS_AGENT_TRACE=1 on your custom image.

This gives you full HTTP tracing and internal agent logs in your job workspace (see Review logs to diagnose pipeline issues).

Check Azure DevOps Service Status

Go to status.dev.azure.com and filter for “Managed DevOps Pools” or “Azure Pipelines” around your freeze timestamps. There was a transient MS-hosted glitch on 3/24, but nothing since then is publicly noted.
Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-05-05T08:55:26.6033333+00:00

Hi @Pall Bjornsson , following up to see if you had a chance to check the analysis shared over private message and if it was helpful. Please let us know if you have any questions.

Answer accepted by question author

2 additional answers

Your answer

Pall Bjornsson 20 Reputation points

2026-04-01T17:24:25.8833333+00:00

Just want to mention that we switched to MDP agents approximately 3 weeks ago. They have been working and running perfectly, until that last Friday.

Yesterday it was so bad that I had to do 5 attempts until I got one of my pipelines to succeed.

Today, it's similar. I have to watch every pipeline run and retry the failing ones. 1-3 retries usually succeed, but not always.

Palli
Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-04-02T22:56:55.3633333+00:00

@Pall Bjornsson ,

Here are some diagnostics and next steps you can try, plus pointers to the MDP docs.

Validate your network connectivity

Run the “Validate MDP Pool Connectivity” PowerShell script on a VM in the same subnet as your agents (see Configure Managed DevOps Pools networking → Validate endpoint connectivity). Even intermittent drops or DNS failures can make the agent process hang without logging. If any endpoints fail, work with your network team to open them.

Turn on and ship MDP diagnostic logs

In the Azure portal, go to your MDP pool → Monitor → Diagnostic settings.

Create a diagnostic setting to send logs to a Log Analytics workspace or storage account (see Troubleshoot Managed DevOps Pools issues → Diagnostic logs).

After a freeze, query the Log Analytics table for any AgentProvisioning or AgentLifecycle records around that time.

Enable interactive mode on the pool

In the pool’s Security tab, turn on EnableInteractiveMode (see How to enable interactive mode for agents in MDP).

When an agent goes into that stuck state, you can SSH into the VM to inspect /var/az_agent, kernel logs, network stats, etc.

Turn up agent and task logging

In your pipeline YAML, set variables:

system.debug: true AGENT_TRACE: 1

Or export VSTS_AGENT_TRACE=1 on your custom image.

This gives you full HTTP tracing and internal agent logs in your job workspace (see Review logs to diagnose pipeline issues).

Check Azure DevOps Service Status

Go to status.dev.azure.com and filter for “Managed DevOps Pools” or “Azure Pipelines” around your freeze timestamps. There was a transient MS-hosted glitch on 3/24, but nothing since then is publicly noted.
Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-05-05T08:55:26.6033333+00:00

Hi @Pall Bjornsson , following up to see if you had a chance to check the analysis shared over private message and if it was helpful. Please let us know if you have any questions.

Answer 1

Hi @Pall Bjornsson ,

Thanks for reaching out to Microsoft Q&A.

Below is the update received from the backend team;

There was an active platform level incident reported: Agents are freezing and not finishing builds in Azure DevOps using MDP pool which lasted for 5 Days and 22 hours where multiple customer subscriptions were affected:

There was a service-side issue in Managed DevOps Pools where agents could stop mid-job with no further logs or heartbeat, so the pipeline would just sit in a running state until it timed out. That was identified internally as a regression and later fixed with a hotfix.

Looking at what you described (jobs hanging mid-execution, no logs, and then everything going back to normal without any changes), this doesn’t point to anything in your pipeline or setup. It lines up with that kind of platform behavior.

Based on the symptoms and how it resolved, I’d lean toward this being service-side rather than anything on your end.

I know how disruptive something like this can be, especially when there’s no signal in the logs to work with. If everything has been stable since then, that’s a good sign it was tied to that fix and shouldn’t continue to impact you.

Hope this helps!

If the resolution was helpful, kindly take a moment to click on User's image and click on Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 2

@Pravallika KV ,

Today has been the first day since the failure started to occur 7 days ago, where I have seen no freezing. I have been running a 5 second interval polling script on the jumpbox inside the agent VNET for 10 hours. The last 3 hours of those 10, I tried to keep all agents busy for both agent types, continuously running the same pipelines that have been failing on us the past 7 days. No failure today and the polling indicates a very good network stability. Occasional transient single isolated failures that don't repeat on next poll and the failures are spread all over the polled resources, so no pattern there.

I feel like these are both good and bad news. Good of course to have no failures. Bad because failures have been ongoing for 7 full days and if they magically stop without any strategic actions on the backend, they are likely to surface again.

For the time being, I will keep the jumpbox poller running in case the failures start again. If they do, we can at least correlate the failures with the poller to see if network issues were noticed at the same time.

In case you are aware of any backend changes that may have contributed to a fix, please let me know.

I will keep monitoring until middle of next week at least.

Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-04-07T02:35:50.9733333+00:00

Thanks for the update @Pall Bjornsson . I have reached out to you over private message. Please check and provide requested information.

Answer 3

@Pravallika KV , thank hank you for the suggestions. Here is where we stand after investigating each point:

Diagnostic settings (point 2) Already fully configured on both pools — allLogs enabled, sending to Log Analytics, managed in Terraform. However, after querying the workspace, only the MDPResourceLog table exists. There are no AgentProvisioning or AgentLifecycle tables. All entries in MDPResourceLog (Provision, Return, Reimage) show Status: Completed with no errors. The freeze produces zero trace in any logging surface available to us.

Interactive mode (point 3) We plan to enable this on our Terraform pool specifically. We have made a useful observation about the agent state during a freeze: the agent appears to remain in Allocated state for a significant period before transitioning to Ready, and does not get reassigned a new job until after the pipeline timeout fires. This suggests the VM stays alive during the freeze and we likely have a viable diagnostic window to SSH in. We are preparing to enable interactive mode with an extended grace period to catch the next freeze.

Debug logging (point 4) We will enable system.debug: true and AGENT_TRACE=1. We don't expect this to capture the freeze event itself since the agent produces zero output after the freeze point, but it may reveal something in the lead-up.

Network validation (point 1) Our setup uses Azure-provided DNS and a standard NSG with no custom routing. Agents succeed most of the time and the freeze is entirely random, including on steps with no external network calls. A network issue would typically produce a connection error rather than a silent hang, so we consider this low probability. We will however run the connectivity validation script from within the subnet as a baseline.

Additional data point on agent state We have observed that MDP sometimes shows a frozen agent as Ready in the pool before the ADO pipeline has timed out. MDP then deprovisioned that agent while ADO still reported the job as Running. This state divergence between MDP and ADO during the freeze may be a useful signal for the engineering team.

We will report back with SSH diagnostics from the next freeze event. We also have an open support request via our CSP. The issue has been getting progressively worse since 2026-03-27 — yesterday required 1-3 retries per pipeline throughout the day. Any escalation to the MDP engineering team who can access internal pool and VM lifecycle telemetry would be greatly appreciated.

Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator

2026-04-03T17:49:41.1366667+00:00

@Pall Bjornsson ,Thanks for the update, we will check and keep you updated. Please keep us posted with any additional logs or findings from your side that will help accelerate root cause analysis.
Siddhesh Desai 6,555 Reputation points Microsoft External Staff Moderator

2026-04-10T09:31:14.8166667+00:00

Hi @Pall Bjornsson

Can you please share me your email id in the private chat, It is necessary for communication and escalation of your case.
Siddhesh Desai 6,555 Reputation points Microsoft External Staff Moderator

2026-05-05T07:42:33.3833333+00:00

Hi @Pall Bjornsson

I got an update from the backend team; there was an active platform level incident reported: Agents are freezing and not finishing builds in Azure DevOps using MDP pool

which lasted for 5 Days and 22 hours where multiple customer subscriptions were affected:

The backend engineer said:

There was a service-side issue in Managed DevOps Pools where agents could stop mid-job with no further logs or heartbeat, so the pipeline would just sit in a running state until it timed out. That was identified internally as a regression and later fixed with a hotfix.

Looking at what you described (jobs hanging mid-execution, no logs, and then everything going back to normal without any changes), this doesn’t point to anything in your pipeline or setup. It lines up with that kind of platform behavior.

Based on the symptoms and how it resolved, I’d lean toward this being service-side rather than anything on your end.

I know how disruptive something like this can be, especially when there’s no signal in the logs to work with. If everything has been stable since then, that’s a good sign it was tied to that fix and shouldn’t continue to impact you.

Share via

MDP agents freeze mid-job with complete loss of output — random freeze point, empty diagnostic logs, both pools affected

2 additional answers

Your answer