Skip to content

Replication Troubleshooting

Every replication failure surfaces a stable error code. The code drives the UI message, alert grouping, and the docs link. Use it to find the right fix below.


Quick Index

Code Recoverable by Section
source_auth_failed Operator Auth
target_auth_failed Operator Auth
rbac_backup_key Operator RBAC
rbac_restore_key Operator RBAC
source_not_found Auto Inventory
target_soft_deleted Operator Conflict
target_conflict_active User (acknowledge) Conflict
cross_geography_key Scope (info only) Geography
unknown_key_type Auto → Operator Inventory
throttled Auto (retry) Transient
transient_network Auto (retry) Transient
unknown Manual Catch-all

Error Codes

source_auth_failed

What it means: CertifyClouds couldn't authenticate against the source vault (HTTP 401, or 403 on a secret/certificate read path).

Why it happens

  • The identity CertifyClouds runs as has no read role assigned on the source vault
  • The vault uses the legacy Access Policies model and the principal is not listed
  • A service principal secret expired
  • Managed identity is disabled on the host

How to fix it

  1. Confirm which identity your deployment uses (Settings > Azure Identity or $AZURE_CLIENT_ID)
  2. Check role assignments:
    az role assignment list \
      --assignee $PRINCIPAL_ID \
      --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<source-vault>
    
  3. Grant Key Vault Secrets User + Key Vault Certificate User + Key Vault Crypto User on the source vault (or assign the custom source role, see RBAC)
  4. Wait up to 5 minutes for RBAC propagation, then re-run Validate connection on the config

When to escalate: After confirming the assignments are present and propagated, if validate still fails with source_auth_failed, open a support ticket with the correlation ID from the response header.


target_auth_failed

What it means: Authentication failure on the target vault (HTTP 401, or 403 on a write path that isn't specifically a key backup/restore).

Why it happens

  • No write role on the target
  • Target vault uses Access Policies and the principal isn't listed
  • Vault firewall blocks the egress IP of the CertifyClouds deployment

How to fix it

  1. Verify write permissions on the target:
    az role assignment list \
      --assignee $PRINCIPAL_ID \
      --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<target-vault>
    
  2. Grant Key Vault Secrets Officer + Key Vault Certificates Officer + Key Vault Crypto Officer on the target (or assign the custom target role)
  3. If the vault uses a firewall, allow the CertifyClouds egress subnet or public IP
  4. Retry the sync from the records table (Replicate now)

When to escalate: If target_auth_failed persists after both RBAC and firewall checks come back clean.


rbac_backup_key

What it means: HTTP 403 specifically on the keys/backup data action against the source vault. The engine can read secrets but can't back up keys.

Why it happens: The source role is missing Microsoft.KeyVault/vaults/keys/backup/action. Common when using a hand-rolled role that only covers secrets and certificates.

How to fix it

Grant one of:

  • Key Vault Crypto User role on the source vault, or
  • The custom source role (includes keys/backup/action)
az role assignment create \
  --assignee $PRINCIPAL_ID \
  --role "Key Vault Crypto User" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<source-vault>

Then Replicate now on the blocked key records. If key replication isn't required for this pair, uncheck keys in the config's What to replicate: the blocked records will flip to excluded on the next expansion.

When to escalate: Not typically needed. This is a pure RBAC fix.


rbac_restore_key

What it means: HTTP 403 on the keys/restore data action against the target vault. The backup was fetched successfully but can't be restored.

Why it happens: Target role missing Microsoft.KeyVault/vaults/keys/restore/action.

How to fix it

Grant one of:

  • Key Vault Crypto Officer on the target vault, or
  • The custom target role
az role assignment create \
  --assignee $PRINCIPAL_ID \
  --role "Key Vault Crypto Officer" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<target-vault>

When to escalate: Not typically needed.


source_not_found

What it means: The source item no longer exists (HTTP 404 on the source read). The record was valid when expansion ran, but the item has since been deleted, renamed, or had its matching tag removed.

Why it happens

  • User deleted or renamed the item in the source vault
  • Matching tag was removed
  • Rule was narrowed and this item no longer qualifies

How to fix it: Nothing to do manually. The orphan flow handles this automatically: if the item is missing for two consecutive expansion cycles, the record is disabled (status flips to no_longer_matched). If the item reappears, tag re-added, undeleted, the next expansion re-enables the record.

If you want immediate cleanup, trigger Re-expand rules on the config.

When to escalate: If records stay in source_not_found for more than two full expansion cycles without being disabled, file a bug with the correlation ID.


target_soft_deleted

What it means: The target vault has a soft-deleted item with the same name. Azure blocks writes to a name currently in the soft-deleted state until the item is either purged or recovered.

Why it happens: Someone deleted the item on the target vault in a previous sync cycle or out-of-band, and the vault's soft-delete retention period hasn't expired.

How to fix it

The record shows two actions:

  1. Open Azure Portal ↗: deep-link to the target vault's deleted items blade. Options there:
    • Recover: bring the existing item back (replication will still try to overwrite on next sync if it's source of truth)
    • Purge: permanently delete the soft-deleted item, clearing the name for CertifyClouds to write
  2. Retry: re-attempts sync. This fails until the soft-deleted item is resolved.

In-app purge is not exposed in v1

CertifyClouds deliberately does not provide in-app purge of target items: blast radius is too high, and purge is irreversible. Use the Azure Portal.

When to escalate: Not applicable; resolution is in your hands.


target_conflict_active

What it means: First-write conflict. The target vault already has an active (non-soft-deleted) item with the same name. CertifyClouds blocks the write until you explicitly acknowledge the overwrite.

Why it happens

  • Target vault was pre-populated out-of-band
  • Two replication configs target the same vault with overlapping rules
  • A human wrote to the target vault directly

How to fix it

  1. Open the record in the table
  2. Click Acknowledge conflict. The modal shows source and target metadata (name, updated times) with an explicit overwrite warning
  3. Confirm to set acknowledged_conflict=true. The next sync will overwrite the target.

Or, if this is a systemic choice for the whole config, set overwrite_on_conflict=true in the config's behaviour settings; first-write conflicts will then overwrite immediately without the acknowledge step.

When to escalate: Only if the acknowledge flow itself fails with a different error code.


cross_geography_key

What it means: Key replication is blocked because the source and target vaults are in different Azure geographies. Azure's backup_key / restore_key operations are same-geography only, for software keys and HSM keys alike.

This code is also used when CertifyClouds' geography map returns "unknown" for either vault; a conservative block rather than a silent attempt.

Why it happens

  • Source in (e.g.) europe, target in north_america
  • Vault in a brand-new Azure region the geography map hasn't been taught yet

How to fix it

  • Secrets and certificates are unaffected: they replicate across geographies. If keys aren't required for this pair, uncheck keys in the config's What to replicate.
  • For same-geography key replication: pick a target vault in the source's geography.
  • For cross-geography keys: not supported. There is no BYOK workaround built into the product.
  • For unknown geography: wait for a CertifyClouds release that adds the region to the map, or open a support ticket with the vault region so we can prioritise the update.

When to escalate: Only to request a geography map update for a new region.


unknown_key_type

What it means: The scanner doesn't yet have a key_type value (RSA, RSA-HSM, EC, etc.) for this key, so the expander can't evaluate the cross-geography gate.

Why it happens: The scanner hasn't run a full cycle since the key was created.

How to fix it

  1. Wait for the next expansion cycle, the expander will auto-retry and populate key_type lazily
  2. If it persists, trigger a fresh discovery scan on the source subscription, then Re-expand rules on the config

When to escalate: If unknown_key_type persists after two scan-plus-expand cycles, open a ticket with the correlation ID.


throttled

What it means: Azure returned HTTP 429 (rate limited). The engine automatically retries with exponential backoff, respecting the Retry-After header.

Why it happens: High activity against the same vault, bulk replicates, simultaneous discovery scans, or other tenants sharing Azure's per-vault limits.

How to fix it: Usually nothing. The engine handles it.

If a record has exhausted retries and surfaced throttled to you:

  1. Reduce parallel activity (pause other bulk operations on the vault)
  2. Increase the sync interval for this config (5m → 1h or 24h)
  3. Manually Retry the record

When to escalate: If throttled appears without a clear traffic spike, or persists across multiple cycles, there may be a quota issue on the vault.


transient_network

What it means: HTTP 5xx, timeout, or connection error. The engine automatically retries with exponential backoff.

Why it happens: Azure service blip, network hiccup, DNS lag.

How to fix it: Usually nothing. If records persist in transient_network after retries are exhausted, click Retry on the record or wait for the next automatic cycle.

When to escalate: If many records fail with transient_network at once across multiple configs, check the Azure status page for your region. If Azure is green, gather correlation IDs and open a ticket.


unknown

What it means: An exception the engine didn't recognise. Full exc_info is logged backend-side.

Why it happens: A case the error-mapping layer doesn't cover yet. Treat as a bug in CertifyClouds, not in your configuration.

How to fix it

  1. Grab the correlation ID from the record's history entry or the X-Correlation-Id response header
  2. Open a support ticket with:
    • Correlation ID
    • Config ID
    • Record ID
    • Rough timestamp of the failure
  3. While waiting for a fix, if the record is critical, try Retry: intermittent unknowns sometimes clear on the next attempt.

When to escalate: Every unknown should be reported. This code exists specifically to surface gaps in our error mapping.


General Tips

Correlation IDs

Every sync run generates a correlation ID. Quote it in any support ticket: it lets us pull the exact run from logs, audit events, and WebSocket streams in one query.

Find it in:

  • The Record history entries shown in the UI for each replication record
  • The X-Correlation-Id response header on any replication API response
  • Backend log lines for that run

Alert Flooding

Repeated failures on the same record are de-duplicated in the audit log to prevent a single stuck record from flooding alerts. Retry counts and error counts continue to be tracked.

If you're not seeing alerts you expect, confirm alerts.notify_on_failure is enabled on the config's alerts settings.

Re-expansion Timeout

Configs save and expansion times out? The response is marked truncated: true with expansion_state: 'partial'. The config is still persisted, and the next worker cycle resumes expansion from the last checkpointed batch (last_batch_succeeded_at). No data loss. A banner in the UI tells you expansion is still running.

Still Stuck?

See also: