Observability
1. Observability Goals
Observability exists to answer:
- What is happening right now?
- Why did it fail?
- Who is impacted (tenant/user)?
- Is it getting worse?
In TestoQA, observability must be:
- tenant-aware (support per-project debugging)
- secure (no secrets or cross-tenant leakage)
- actionable (supports alerts and triage)
2. Logging Strategy
What we log
- requests entering boundaries (high-level)
- authorization outcomes (allowed/denied), without sensitive subjects
- service workflow milestones
- integration calls (timing + status, not payload)
- errors with classification (see
error-handling.mdx)
How we log
- structured logs (key/value)
- consistent fields across all server entry points
Recommended common fields:
requestId(correlation id)userId(when authenticated)projectId(when tenant-resolved)routeoractionmodule/layer(boundary/service/repo)durationMsresult(success/failure)errorType(when applicable)
Redaction rules
Never log:
- session tokens/cookies
- secrets/env vars
- raw upload contents
- sensitive artifacts/prompt-like payloads (if applicable)
Prefer:
- hashes
- counts
- sizes
- opaque IDs
3. Metrics and KPIs
Metrics should allow tracking system health and identifying regressions.
Core platform metrics
- request rate
- request latency (p50/p95/p99)
- error rate (by category)
- throughput (actions/handlers)
- saturation signals (CPU/memory)
Database metrics
- query latency
- connection pool usage
- slow queries
- transaction duration
- lock contention
Domain metrics (recommended)
- test runs started/completed/failed
- execution throughput
- report generation duration
- upload size distribution
- realtime publish rates
Tenant-awareness
Where appropriate, metrics should be taggable by projectId—but do not create unbounded cardinality if project count is large. Use tenant tagging selectively for:
- error investigation
- high-value workflows
4. Tracing and Correlation
Even without full distributed tracing, correlation is required.
Correlation identifiers
- Every boundary invocation should have a
requestId. - That
requestIdshould be propagated through:- services
- repositories
- integration calls
What correlation enables
- linking UI-visible errors to server logs
- measuring end-to-end latency for workflows
- debugging concurrency issues
If/when distributed tracing is introduced, it should:
- preserve tenant safety
- avoid sensitive payload capture
5. Audit Logging (Multi-Tenant)
Some actions should be auditable (who did what, when, in which project).
Audit-worthy actions (examples)
- membership/role changes
- deletion of test cases/runs
- configuration changes
- upload access changes
- actions that affect billing/limits (if any)
Audit log requirements
- include
userId,projectId, timestamp - include action type and target identifiers
- exclude sensitive payload content
- be immutable or append-only where feasible
6. Alerting Principles
Alerts should be:
- actionable (someone can do something)
- low-noise
- aligned to user impact
Recommended alerts
- elevated 5xx error rate
- elevated auth/tenant-resolution failures (may indicate incident or abuse)
- database connection exhaustion
- sustained high latency
- integration provider failures (storage/realtime/email)
Tenant-specific incidents
Have a playbook posture for:
- “single project affected” vs “system-wide”
7. Operational Readiness Checklist
- Request correlation (
requestId) exists and is logged - Logs are structured and redacted
- Metrics cover latency, errors, DB health
- Alerts exist for high-impact failures
- Audit logging exists for sensitive actions
- Tenant leaks are not possible through logs/telemetry
Last updated on