English {#english}

Symptoms: Job FAILED in Control Plane dashboard or D1; provisioning failure alerts (e.g. Slack). Checks: D1 trace_events/provision_jobs, Prometheus queue backlog, Ansible/Pulumi and workflow logs (GET /trace/:trace_id). Actions: Transient → Retry or workflow_dispatch; DB capacity → assign new DB host and re-run Ansible; Unrecoverable → run compensation transaction and document failure. Ref: api-control-plane-implementation-plan §19.1, provision-tenant-pipeline, OPERATIONS.

flowchart TD
  A[FAILED job] --> B{D1 / logs}
  B --> C[Transient?]
  C -->|Yes| D[Retry]
  C -->|No| E[DB / capacity?]
  E -->|Yes| F[New host + Ansible re-run]
  E -->|No| G[Compensation + document]

Full details: see Korean section below.

한국어 {#korean}

Runbook: 테넌트 생성 실패 (Provisioning Failure)

기획서: api-control-plane-implementation-plan §19.1

증상

Control Plane 대시보드(또는 D1/Grafana)에 해당 job FAILED 상태
Slack 등으로 프로비저닝 실패 알림 수신

확인

D1: 해당 tenant_id, job_id의 trace_events, provision_jobs 로그 확인
Prometheus/Grafana: queue_backlog_size(또는 provision_jobs Pending 수) 급증 여부 확인
Ansible·Pulumi: GitHub Actions 또는 워크플로 로그에서 실패 단계·에러 메시지 확인 (LogPath: GET /trace/:trace_id)

조치

판단	조치
일시적 오류	재시도(Retry) 버튼 또는 동일 job_id로 workflow_dispatch 재실행
DB 서버 용량 부족	새 DB 호스트 할당 후 해당 테넌트에 대해 Ansible Re-run(또는 플랜 업그레이드 큐 활용)
해결 불가	보상 트랜잭션 수동 실행으로 이미 생성된 자원 정리 후, 해당 테넌트에 실패 사유·다음 단계 안내. 기획서 §14 참조.

참조: provision-tenant-pipeline, OPERATIONS, api-control-plane-implementation-plan §19.3 Runbook 인덱스.