Observability

1. Observability Goals

Observability exists to answer:

What is happening right now?
Why did it fail?
Who is impacted (tenant/user)?
Is it getting worse?

In TestoQA, observability must be:

tenant-aware (support per-project debugging)
secure (no secrets or cross-tenant leakage)
actionable (supports alerts and triage)

2. Logging Strategy

What we log

requests entering boundaries (high-level)
authorization outcomes (allowed/denied), without sensitive subjects
service workflow milestones
integration calls (timing + status, not payload)
errors with classification (see error-handling.mdx)

How we log

structured logs (key/value)
consistent fields across all server entry points

Recommended common fields:

requestId (correlation id)
userId (when authenticated)
projectId (when tenant-resolved)
route or action
module / layer (boundary/service/repo)
durationMs
result (success/failure)
errorType (when applicable)

Redaction rules

Never log:

session tokens/cookies
secrets/env vars
raw upload contents
sensitive artifacts/prompt-like payloads (if applicable)

Prefer:

hashes
counts
sizes
opaque IDs

3. Metrics and KPIs

Metrics should allow tracking system health and identifying regressions.

Core platform metrics

request rate
request latency (p50/p95/p99)
error rate (by category)
throughput (actions/handlers)
saturation signals (CPU/memory)

Database metrics

query latency
connection pool usage
slow queries
transaction duration
lock contention

Domain metrics (recommended)

test runs started/completed/failed
execution throughput
report generation duration
upload size distribution
realtime publish rates

Tenant-awareness

Where appropriate, metrics should be taggable by projectId—but do not create unbounded cardinality if project count is large. Use tenant tagging selectively for:

error investigation
high-value workflows

4. Tracing and Correlation

Even without full distributed tracing, correlation is required.

Correlation identifiers

Every boundary invocation should have a requestId.
That requestId should be propagated through:
- services
- repositories
- integration calls

What correlation enables

linking UI-visible errors to server logs
measuring end-to-end latency for workflows
debugging concurrency issues

If/when distributed tracing is introduced, it should:

preserve tenant safety
avoid sensitive payload capture

5. Audit Logging (Multi-Tenant)

Some actions should be auditable (who did what, when, in which project).

Audit-worthy actions (examples)

membership/role changes
deletion of test cases/runs
configuration changes
upload access changes
actions that affect billing/limits (if any)

Audit log requirements

include userId, projectId, timestamp
include action type and target identifiers
exclude sensitive payload content
be immutable or append-only where feasible

6. Alerting Principles

Alerts should be:

actionable (someone can do something)
low-noise
aligned to user impact

Recommended alerts

elevated 5xx error rate
elevated auth/tenant-resolution failures (may indicate incident or abuse)
database connection exhaustion
sustained high latency
integration provider failures (storage/realtime/email)

Tenant-specific incidents

Have a playbook posture for:

“single project affected” vs “system-wide”

7. Operational Readiness Checklist

Request correlation (requestId) exists and is logged
Logs are structured and redacted
Metrics cover latency, errors, DB health
Alerts exist for high-impact failures
Audit logging exists for sensitive actions
Tenant leaks are not possible through logs/telemetry