Replication Troubleshooting¶

Every replication failure surfaces a stable error code. The code drives the UI message, alert grouping, and the docs link. Use it to find the right fix below.

Quick Index¶

Code	Recoverable by	Section
`source_auth_failed`	Operator	Auth
`target_auth_failed`	Operator	Auth
`rbac_backup_key`	Operator	RBAC
`rbac_restore_key`	Operator	RBAC
`source_not_found`	Auto	Inventory
`target_soft_deleted`	Operator	Conflict
`target_conflict_active`	User (acknowledge)	Conflict
`cross_geography_key`	Scope (info only)	Geography
`unknown_key_type`	Auto → Operator	Inventory
`throttled`	Auto (retry)	Transient
`transient_network`	Auto (retry)	Transient
`unknown`	Manual	Catch-all

Error Codes¶

`source_auth_failed`¶

What it means: CertifyClouds couldn't authenticate against the source vault (HTTP 401, or 403 on a secret/certificate read path).

Why it happens

The identity CertifyClouds runs as has no read role assigned on the source vault
The vault uses the legacy Access Policies model and the principal is not listed
A service principal secret expired
Managed identity is disabled on the host

How to fix it

Confirm which identity your deployment uses (Settings > Azure Identity or $AZURE_CLIENT_ID)

Check role assignments:

az role assignment list \
  --assignee $PRINCIPAL_ID \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<source-vault>

Grant Key Vault Secrets User + Key Vault Certificate User + Key Vault Crypto User on the source vault (or assign the custom source role, see RBAC)
Wait up to 5 minutes for RBAC propagation, then re-run Validate connection on the config

When to escalate: After confirming the assignments are present and propagated, if validate still fails with source_auth_failed, open a support ticket with the correlation ID from the response header.

`target_auth_failed`¶

What it means: Authentication failure on the target vault (HTTP 401, or 403 on a write path that isn't specifically a key backup/restore).

Why it happens

No write role on the target
Target vault uses Access Policies and the principal isn't listed
Vault firewall blocks the egress IP of the CertifyClouds deployment

How to fix it

Verify write permissions on the target:

az role assignment list \
  --assignee $PRINCIPAL_ID \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<target-vault>

Grant Key Vault Secrets Officer + Key Vault Certificates Officer + Key Vault Crypto Officer on the target (or assign the custom target role)
If the vault uses a firewall, allow the CertifyClouds egress subnet or public IP
Retry the sync from the records table (Replicate now)

When to escalate: If target_auth_failed persists after both RBAC and firewall checks come back clean.

`rbac_backup_key`¶

What it means: HTTP 403 specifically on the keys/backup data action against the source vault. The engine can read secrets but can't back up keys.

Why it happens: The source role is missing Microsoft.KeyVault/vaults/keys/backup/action. Common when using a hand-rolled role that only covers secrets and certificates.

How to fix it

Grant one of:

Key Vault Crypto User role on the source vault, or
The custom source role (includes keys/backup/action)

az role assignment create \
  --assignee $PRINCIPAL_ID \
  --role "Key Vault Crypto User" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<source-vault>

Then Replicate now on the blocked key records. If key replication isn't required for this pair, uncheck keys in the config's What to replicate: the blocked records will flip to excluded on the next expansion.

When to escalate: Not typically needed. This is a pure RBAC fix.

`rbac_restore_key`¶

What it means: HTTP 403 on the keys/restore data action against the target vault. The backup was fetched successfully but can't be restored.

Why it happens: Target role missing Microsoft.KeyVault/vaults/keys/restore/action.

How to fix it

Grant one of:

Key Vault Crypto Officer on the target vault, or
The custom target role

az role assignment create \
  --assignee $PRINCIPAL_ID \
  --role "Key Vault Crypto Officer" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<target-vault>

When to escalate: Not typically needed.

`source_not_found`¶

What it means: The source item no longer exists (HTTP 404 on the source read). The record was valid when expansion ran, but the item has since been deleted, renamed, or had its matching tag removed.

Why it happens

User deleted or renamed the item in the source vault
Matching tag was removed
Rule was narrowed and this item no longer qualifies

How to fix it: Nothing to do manually. The orphan flow handles this automatically: if the item is missing for two consecutive expansion cycles, the record is disabled (status flips to no_longer_matched). If the item reappears, tag re-added, undeleted, the next expansion re-enables the record.

If you want immediate cleanup, trigger Re-expand rules on the config.

When to escalate: If records stay in source_not_found for more than two full expansion cycles without being disabled, file a bug with the correlation ID.

`target_soft_deleted`¶

What it means: The target vault has a soft-deleted item with the same name. Azure blocks writes to a name currently in the soft-deleted state until the item is either purged or recovered.

Why it happens: Someone deleted the item on the target vault in a previous sync cycle or out-of-band, and the vault's soft-delete retention period hasn't expired.

How to fix it

The record shows two actions:

Open Azure Portal ↗: deep-link to the target vault's deleted items blade. Options there:
- Recover: bring the existing item back (replication will still try to overwrite on next sync if it's source of truth)
- Purge: permanently delete the soft-deleted item, clearing the name for CertifyClouds to write
Retry: re-attempts sync. This fails until the soft-deleted item is resolved.

In-app purge is not exposed in v1

CertifyClouds deliberately does not provide in-app purge of target items: blast radius is too high, and purge is irreversible. Use the Azure Portal.

When to escalate: Not applicable; resolution is in your hands.

`target_conflict_active`¶

What it means: First-write conflict. The target vault already has an active (non-soft-deleted) item with the same name. CertifyClouds blocks the write until you explicitly acknowledge the overwrite.

Why it happens

Target vault was pre-populated out-of-band
Two replication configs target the same vault with overlapping rules
A human wrote to the target vault directly

How to fix it

Open the record in the table
Click Acknowledge conflict. The modal shows source and target metadata (name, updated times) with an explicit overwrite warning
Confirm to set acknowledged_conflict=true. The next sync will overwrite the target.

Or, if this is a systemic choice for the whole config, set overwrite_on_conflict=true in the config's behaviour settings; first-write conflicts will then overwrite immediately without the acknowledge step.

When to escalate: Only if the acknowledge flow itself fails with a different error code.

`cross_geography_key`¶

What it means: Key replication is blocked because the source and target vaults are in different Azure geographies. Azure's backup_key / restore_key operations are same-geography only, for software keys and HSM keys alike.

This code is also used when CertifyClouds' geography map returns "unknown" for either vault; a conservative block rather than a silent attempt.

Why it happens

Source in (e.g.) europe, target in north_america
Vault in a brand-new Azure region the geography map hasn't been taught yet

How to fix it

Secrets and certificates are unaffected: they replicate across geographies. If keys aren't required for this pair, uncheck keys in the config's What to replicate.
For same-geography key replication: pick a target vault in the source's geography.
For cross-geography keys: not supported. There is no BYOK workaround built into the product.
For unknown geography: wait for a CertifyClouds release that adds the region to the map, or open a support ticket with the vault region so we can prioritise the update.

When to escalate: Only to request a geography map update for a new region.

`unknown_key_type`¶

What it means: The scanner doesn't yet have a key_type value (RSA, RSA-HSM, EC, etc.) for this key, so the expander can't evaluate the cross-geography gate.

Why it happens: The scanner hasn't run a full cycle since the key was created.

How to fix it

Wait for the next expansion cycle, the expander will auto-retry and populate key_type lazily
If it persists, trigger a fresh discovery scan on the source subscription, then Re-expand rules on the config

When to escalate: If unknown_key_type persists after two scan-plus-expand cycles, open a ticket with the correlation ID.

`throttled`¶

What it means: Azure returned HTTP 429 (rate limited). The engine automatically retries with exponential backoff, respecting the Retry-After header.

Why it happens: High activity against the same vault, bulk replicates, simultaneous discovery scans, or other tenants sharing Azure's per-vault limits.

How to fix it: Usually nothing. The engine handles it.

If a record has exhausted retries and surfaced throttled to you:

Reduce parallel activity (pause other bulk operations on the vault)
Increase the sync interval for this config (5m → 1h or 24h)
Manually Retry the record

When to escalate: If throttled appears without a clear traffic spike, or persists across multiple cycles, there may be a quota issue on the vault.

`transient_network`¶

What it means: HTTP 5xx, timeout, or connection error. The engine automatically retries with exponential backoff.

Why it happens: Azure service blip, network hiccup, DNS lag.

How to fix it: Usually nothing. If records persist in transient_network after retries are exhausted, click Retry on the record or wait for the next automatic cycle.

When to escalate: If many records fail with transient_network at once across multiple configs, check the Azure status page for your region. If Azure is green, gather correlation IDs and open a ticket.

`unknown`¶

What it means: An exception the engine didn't recognise. Full exc_info is logged backend-side.

Why it happens: A case the error-mapping layer doesn't cover yet. Treat as a bug in CertifyClouds, not in your configuration.

How to fix it

Grab the correlation ID from the record's history entry or the X-Correlation-Id response header
Open a support ticket with:
- Correlation ID
- Config ID
- Record ID
- Rough timestamp of the failure
While waiting for a fix, if the record is critical, try Retry: intermittent unknowns sometimes clear on the next attempt.

When to escalate: Every unknown should be reported. This code exists specifically to surface gaps in our error mapping.

General Tips¶

Correlation IDs¶

Every sync run generates a correlation ID. Quote it in any support ticket: it lets us pull the exact run from logs, audit events, and WebSocket streams in one query.

Find it in:

The Record history entries shown in the UI for each replication record
The X-Correlation-Id response header on any replication API response
Backend log lines for that run

Alert Flooding¶

Repeated failures on the same record are de-duplicated in the audit log to prevent a single stuck record from flooding alerts. Retry counts and error counts continue to be tracked.

If you're not seeing alerts you expect, confirm alerts.notify_on_failure is enabled on the config's alerts settings.

Re-expansion Timeout¶

Configs save and expansion times out? The response is marked truncated: true with expansion_state: 'partial'. The config is still persisted, and the next worker cycle resumes expansion from the last checkpointed batch (last_batch_succeeded_at). No data loss. A banner in the UI tells you expansion is still running.

Replication Troubleshooting¶

Quick Index¶

Error Codes¶

source_auth_failed¶

target_auth_failed¶

rbac_backup_key¶

rbac_restore_key¶

source_not_found¶

target_soft_deleted¶

target_conflict_active¶

cross_geography_key¶

unknown_key_type¶

throttled¶

transient_network¶

unknown¶

General Tips¶

Correlation IDs¶

Alert Flooding¶

Re-expansion Timeout¶

Still Stuck?¶

`source_auth_failed`¶

`target_auth_failed`¶

`rbac_backup_key`¶

`rbac_restore_key`¶

`source_not_found`¶

`target_soft_deleted`¶

`target_conflict_active`¶

`cross_geography_key`¶

`unknown_key_type`¶

`throttled`¶

`transient_network`¶

`unknown`¶