Skip to Content
🎉 TestoQA 1.0 is released

Observability

1. Observability Goals

Observability exists to answer:

  • What is happening right now?
  • Why did it fail?
  • Who is impacted (tenant/user)?
  • Is it getting worse?

In TestoQA, observability must be:

  • tenant-aware (support per-project debugging)
  • secure (no secrets or cross-tenant leakage)
  • actionable (supports alerts and triage)

2. Logging Strategy

What we log

  • requests entering boundaries (high-level)
  • authorization outcomes (allowed/denied), without sensitive subjects
  • service workflow milestones
  • integration calls (timing + status, not payload)
  • errors with classification (see error-handling.mdx)

How we log

  • structured logs (key/value)
  • consistent fields across all server entry points

Recommended common fields:

  • requestId (correlation id)
  • userId (when authenticated)
  • projectId (when tenant-resolved)
  • route or action
  • module / layer (boundary/service/repo)
  • durationMs
  • result (success/failure)
  • errorType (when applicable)

Redaction rules

Never log:

  • session tokens/cookies
  • secrets/env vars
  • raw upload contents
  • sensitive artifacts/prompt-like payloads (if applicable)

Prefer:

  • hashes
  • counts
  • sizes
  • opaque IDs

3. Metrics and KPIs

Metrics should allow tracking system health and identifying regressions.

Core platform metrics

  • request rate
  • request latency (p50/p95/p99)
  • error rate (by category)
  • throughput (actions/handlers)
  • saturation signals (CPU/memory)

Database metrics

  • query latency
  • connection pool usage
  • slow queries
  • transaction duration
  • lock contention
  • test runs started/completed/failed
  • execution throughput
  • report generation duration
  • upload size distribution
  • realtime publish rates

Tenant-awareness

Where appropriate, metrics should be taggable by projectId—but do not create unbounded cardinality if project count is large. Use tenant tagging selectively for:

  • error investigation
  • high-value workflows

4. Tracing and Correlation

Even without full distributed tracing, correlation is required.

Correlation identifiers

  • Every boundary invocation should have a requestId.
  • That requestId should be propagated through:
    • services
    • repositories
    • integration calls

What correlation enables

  • linking UI-visible errors to server logs
  • measuring end-to-end latency for workflows
  • debugging concurrency issues

If/when distributed tracing is introduced, it should:

  • preserve tenant safety
  • avoid sensitive payload capture

5. Audit Logging (Multi-Tenant)

Some actions should be auditable (who did what, when, in which project).

Audit-worthy actions (examples)

  • membership/role changes
  • deletion of test cases/runs
  • configuration changes
  • upload access changes
  • actions that affect billing/limits (if any)

Audit log requirements

  • include userId, projectId, timestamp
  • include action type and target identifiers
  • exclude sensitive payload content
  • be immutable or append-only where feasible

6. Alerting Principles

Alerts should be:

  • actionable (someone can do something)
  • low-noise
  • aligned to user impact
  • elevated 5xx error rate
  • elevated auth/tenant-resolution failures (may indicate incident or abuse)
  • database connection exhaustion
  • sustained high latency
  • integration provider failures (storage/realtime/email)

Tenant-specific incidents

Have a playbook posture for:

  • “single project affected” vs “system-wide”

7. Operational Readiness Checklist

  • Request correlation (requestId) exists and is logged
  • Logs are structured and redacted
  • Metrics cover latency, errors, DB health
  • Alerts exist for high-impact failures
  • Audit logging exists for sensitive actions
  • Tenant leaks are not possible through logs/telemetry
Last updated on