Share via

MDP agents freeze mid-job with complete loss of output — random freeze point, empty diagnostic logs, both pools affected

Pall Bjornsson 20 Reputation points
2026-04-01T16:21:37.6666667+00:00

Tags: azure-devops, managed-devops-pools, azure-pipelines, vmss

Since approximately 2026-03-27, our Managed DevOps Pool agents have been randomly freezing mid-job. The agent stops producing log output entirely, ADO receives no further heartbeat, and the job stays in "Running" state until the pipeline timeout kills it. This affects both our MDP pools independently.

ENVIRONMENT

JDNextBuild-Set (primary concern):

  • VM SKU: Standard_D4s_v3
  • OS image: Ubuntu 24.04 (custom Azure Image Builder image)
  • Agent version: 4.270.0 (recently also 4.271.0)
  • Region: East US
  • Network: VNet-injected, dedicated subnet
  • Max concurrency: 15

Terraform-Set:

  • VM SKU: Standard_D2s_v3
  • OS image: Ubuntu 24.04 (custom Azure Image Builder image)
  • Agent version: 4.270.0 (recently also 4.271.0)
  • Region: East US
  • Network: VNet-injected, dedicated subnet
  • Max concurrency: 10

Both pools are on separate subnets, separate VM SKUs, and completely different software stacks.

SYMPTOMS

  • Random freeze point: the agent can freeze at any step. On JDNextBuild-Set we have seen freezes during Docker builds, helm upgrade, kubectl operations, and PowerShell steps expected to complete in under 2 seconds. There is no consistent step or pattern.
  • No error output: the last log line is a normal, successful operation. No exception, exit code, or warning precedes the freeze.
  • Agent heartbeat stops: ADO receives no further communication. The job sits in "Running" until the timeout fires.
  • Agent diagnostic logs folder is empty: downloading the job logs and opening the "Agent Diagnostic Logs" folder shows no files whatsoever. This indicates the issue occurs before the agent completes its lifecycle normally.
  • MDPResourceLog shows no errors: checked in Log Analytics, no errors or warnings appear for either pool during affected periods.
  • MDP provisioning metrics are clean: no CustomScriptError or provisioning failures in the Metrics blade.

EXAMPLE FROM RUN LOGS

From a Terraform-Set job:

2026-03-29T02:00:06Z Starting: Terraform Plan - bootstrap

...normal output...

[complete silence for 49 minutes 57 seconds]

2026-03-29T02:50:03Z Finishing: Finalize Job

2026-03-29T02:50:03Z The job running on agent Terraform-Set 4 stopped responding...

From a JDNextBuild-Set job:

2026-03-28T17:10:11Z Agent version: 4.270.0

...normal job output...

[freeze - no output, no error]

Agent Diagnostic Logs folder: empty (0 files)

WHAT WE HAVE RULED OUT

  • Agent version change: 4.270.0 has been in use for months with no issue prior to 2026-03-27. No version change coincides with onset.
  • Image-specific cause: both pools have completely different software (JDNextBuild: Docker, Node.js, .NET, helm, kubectl; Terraform: Terraform, PowerShell Az modules). Same symptom on both rules out anything image-related.
  • Specific task: freezes occur at random points including trivial 2-second tasks. No single step is consistently involved.
  • Burstable SKU throttling: JDNextBuild-Set is on non-burstable D4s_v3. Terraform-Set was moved from B2ms to D2s_v3; the issue predates and persists after this change.
  • Custom DNS / networking: Azure-provided DNS, standard VNet with NSG. No custom DNS or unusual routing.
  • No changes on our end: no image updates, pipeline changes, or infrastructure changes coincide with issue onset.

OBSERVATIONS

We have seen MDP pool report the agent as ready even if the pipeline it was running and is supposed to be running is not yet timed out. And we have seen MDP deprovision that agent while pipeline still reported running.

We have also seen MDP unable to provision more agents even if both concurrent self hosted jobs as well as max pool capacity allow. In that state, there's an MDP agent in the pool as Ready, but doesn't get allocated a pending job, until the failing one times out.

SIMILAR REPORT FOUND

A very similar problem was reported for MS-hosted agents on 2026-03-24:

https://learn.microsoft.com/en-us/answers/questions/5835366/ms-hosted-build-agent-acquisition-is-stuck-(outage

Key similarities:

  • Agent stuck with no meaningful error
  • Agent Diagnostic Logs folder is empty - identical to our MDP case
  • Azure status page showed all green throughout
  • Self-resolved after ~24 hours with no action taken
  • Accepted answer: "none of the advice here helped, it just randomly started working again after 24h"

Our issue differs in that it has persisted for multiple days across two independent pools rather than resolving on its own and is getting gradually worse by the day.

RELATED GITHUB ISSUES

WORKAROUNDS APPLIED (do not resolve root cause)

  • Reduced pipeline job timeouts (plan: 75 min, apply: 120 min) to free up concurrency slots faster after a freeze
  • Increased MDP pool concurrency to provide more headroom while frozen agents hold slots

QUESTIONS

  1. Has anyone else experienced this pattern with MDP agents specifically (not MS-hosted)?
  2. Is there any way to get better diagnostics from inside the MDP VM when a freeze occurs? We cannot access MDP VMs directly as they run in Microsoft's managed subscription.
  3. Is there a known platform issue with MDP in East US around this timeframe and ongoing?

Any insight appreciated

Azure DevOps

Answer accepted by question author

  1. Pravallika KV 14,235 Reputation points Microsoft External Staff Moderator
    2026-04-01T17:15:23.5733333+00:00

    Hi @Pall Bjornsson ,

    Thanks for reaching out to Microsoft Q&A.

    Below is the update received from the backend team;

    There was an active platform level incident reported: Agents are freezing and not finishing builds in Azure DevOps using MDP pool which lasted for 5 Days and 22 hours where multiple customer subscriptions were affected:

    There was a service-side issue in Managed DevOps Pools where agents could stop mid-job with no further logs or heartbeat, so the pipeline would just sit in a running state until it timed out. That was identified internally as a regression and later fixed with a hotfix.

    Looking at what you described (jobs hanging mid-execution, no logs, and then everything going back to normal without any changes), this doesn’t point to anything in your pipeline or setup. It lines up with that kind of platform behavior.

    Based on the symptoms and how it resolved, I’d lean toward this being service-side rather than anything on your end.

    I know how disruptive something like this can be, especially when there’s no signal in the logs to work with. If everything has been stable since then, that’s a good sign it was tied to that fix and shouldn’t continue to impact you.

    Hope this helps!


    If the resolution was helpful, kindly take a moment to click on User's imageand click on Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Pall Bjornsson 20 Reputation points
    2026-04-03T22:47:06.03+00:00

    @Pravallika KV ,

    Today has been the first day since the failure started to occur 7 days ago, where I have seen no freezing. I have been running a 5 second interval polling script on the jumpbox inside the agent VNET for 10 hours. The last 3 hours of those 10, I tried to keep all agents busy for both agent types, continuously running the same pipelines that have been failing on us the past 7 days. No failure today and the polling indicates a very good network stability. Occasional transient single isolated failures that don't repeat on next poll and the failures are spread all over the polled resources, so no pattern there.

    I feel like these are both good and bad news. Good of course to have no failures. Bad because failures have been ongoing for 7 full days and if they magically stop without any strategic actions on the backend, they are likely to surface again.

    For the time being, I will keep the jumpbox poller running in case the failures start again. If they do, we can at least correlate the failures with the poller to see if network issues were noticed at the same time.

    In case you are aware of any backend changes that may have contributed to a fix, please let me know.

    I will keep monitoring until middle of next week at least.


  2. Pall Bjornsson 20 Reputation points
    2026-04-03T10:06:45.1433333+00:00

    @Pravallika KV , thank hank you for the suggestions. Here is where we stand after investigating each point:

    Diagnostic settings (point 2) Already fully configured on both pools — allLogs enabled, sending to Log Analytics, managed in Terraform. However, after querying the workspace, only the MDPResourceLog table exists. There are no AgentProvisioning or AgentLifecycle tables. All entries in MDPResourceLog (Provision, Return, Reimage) show Status: Completed with no errors. The freeze produces zero trace in any logging surface available to us.

    Interactive mode (point 3) We plan to enable this on our Terraform pool specifically. We have made a useful observation about the agent state during a freeze: the agent appears to remain in Allocated state for a significant period before transitioning to Ready, and does not get reassigned a new job until after the pipeline timeout fires. This suggests the VM stays alive during the freeze and we likely have a viable diagnostic window to SSH in. We are preparing to enable interactive mode with an extended grace period to catch the next freeze.

    Debug logging (point 4) We will enable system.debug: true and AGENT_TRACE=1. We don't expect this to capture the freeze event itself since the agent produces zero output after the freeze point, but it may reveal something in the lead-up.

    Network validation (point 1) Our setup uses Azure-provided DNS and a standard NSG with no custom routing. Agents succeed most of the time and the freeze is entirely random, including on steps with no external network calls. A network issue would typically produce a connection error rather than a silent hang, so we consider this low probability. We will however run the connectivity validation script from within the subnet as a baseline.

    Additional data point on agent state We have observed that MDP sometimes shows a frozen agent as Ready in the pool before the ADO pipeline has timed out. MDP then deprovisioned that agent while ADO still reported the job as Running. This state divergence between MDP and ADO during the freeze may be a useful signal for the engineering team.

    We will report back with SSH diagnostics from the next freeze event. We also have an open support request via our CSP. The issue has been getting progressively worse since 2026-03-27 — yesterday required 1-3 retries per pipeline throughout the day. Any escalation to the MDP engineering team who can access internal pool and VM lifecycle telemetry would be greatly appreciated.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.