OpenTelemetry + Grafana Cloud
Open standards for instrumentation. The backend is a config swap, not a rewrite.
Context
Handler failures were initially logged to console. Invisible in production, impossible to query, no correlation to the request that triggered them. The application needed observability without coupling to a specific vendor. Sentry captures errors well but is proprietary. Datadog is full-stack but expensive. OpenTelemetry is the open standard: instrument once, export anywhere.
The decision
Instrument the command and query buses with OpenTelemetry spans. Each handler dispatch creates a child span within the active HTTP request trace. Handler failures enrich the active span with error attributes. Traces export to Grafana Cloud via the OTLP HTTP exporter. Sampling is environment-aware: 100% in development, configurable (default 10%) in production.
Rationale
- 1
The IObservabilityService interface decouples the application from OpenTelemetry. The bus calls recordHandlerFailure() and doesn't know about spans, exporters, or Grafana. Swapping the implementation is a one-file change.
- 2
Bus-level spans fill the blind spot between the HTTP request span and the response. Without them, 300ms+ of handler execution is invisible in the trace timeline. With them, you see exactly which handler ran and how long it took.
- 3
The OpenTelemetry SDK reads OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS from the environment. No Grafana-specific code in the application. Switching to Datadog, Honeycomb, or Jaeger is an env var change.
- 4
Sampling prevents cost overrun in production. TraceIdRatioBasedSampler drops a configurable percentage of traces before export. Errors within sampled traces are always enriched, not silently dropped.
In the codebase
instrumentation.ts. SDK init with environment-aware sampling
export async function register() {
if (process.env.NEXT_RUNTIME === 'nodejs') {
const { NodeSDK } = await import('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = await import('@opentelemetry/exporter-trace-otlp-http');
const { resourceFromAttributes } = await import('@opentelemetry/resources');
const { ATTR_SERVICE_NAME } = await import('@opentelemetry/semantic-conventions');
const { TraceIdRatioBasedSampler, AlwaysOnSampler } = await import('@opentelemetry/sdk-trace-node');
const isProduction = process.env.NODE_ENV === 'production';
const sampleRate = Number(process.env.OTEL_SAMPLE_RATE ?? (isProduction ? 0.1 : 1.0));
const sampler = sampleRate >= 1.0
? new AlwaysOnSampler()
: new TraceIdRatioBasedSampler(sampleRate);
const sdk = new NodeSDK({
resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: 'ledger' }),
traceExporter: new OTLPTraceExporter(),
sampler,
});
sdk.start();
}
}Bus dispatch. Handler spans as children of the request trace
// CommandBus.dispatch. Same pattern in QueryBus
async dispatch<T extends AnyCommand>(command: T): Promise<T['_response']> {
const key = command.constructor.name;
const handler = this._handlers.get(key);
return tracer.startActiveSpan(`command.${key}`, async (span) => {
try {
const result = await handler.execute(command);
span.end();
return result;
} catch (error) {
this.observability.recordHandlerFailure(key, error);
span.end();
throw error;
}
});
}ObservabilityService. Enriches the active span, no new span
const ObservabilityService: IObservabilityService = {
recordHandlerFailure(handlerName: string, error: unknown): void {
const span = trace.getActiveSpan();
if (!span) return;
span.setAttribute('handler.name', handlerName);
span.setAttribute('error.type', error instanceof Error ? error.constructor.name : 'UnknownError');
span.setAttribute('error.message', error instanceof Error ? error.message : 'Unknown error');
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
},
};Scaling path. Tail-based sampling via Collector
// Today: head-based sampling (SDK decides before export)
// Limitation: unsampled requests lose their error traces
//
// Production upgrade: OpenTelemetry Collector with tail-based sampling
//
// App (SDK, sample 100%) -> Collector -> evaluates complete traces
// errors: always export to Grafana
// successes: export 10% to Grafana
// rest: drop
//
// App code unchanged. Collector is a deployment concern.
// OTEL_EXPORTER_OTLP_ENDPOINT points to Collector instead of Grafana.Tradeoffs
Open standard, no vendor lock-in. Same instrumentation works with Grafana, Datadog, Jaeger, Zipkin, or any OTLP-compatible backend.
Head-based sampling means some errors in unsampled traces are never exported. True "always capture errors" requires tail-based sampling via an OpenTelemetry Collector, which is added infrastructure.
Bus-level spans give handler-granularity visibility. You see GetBudgetOverviewHandler took 45ms, not just "POST /budgets returned in 200ms."
Every handler dispatch creates a span. At high throughput this adds overhead. Mitigated by sampling, but the instrumentation cost is nonzero.
Grafana Cloud free tier provides 50GB traces. Effectively unlimited for a portfolio project.
Production traffic at scale would exceed the free tier. The sampling rate (OTEL_SAMPLE_RATE env var) controls cost but requires tuning per deployment.