Frontend Observability & RUM

⏱️ ~3-minute bite · solve the sandbox to master

0%lesson

🧒

5-Year-Old Metaphor

— The physical, real-world picture. No jargon.

✈️ Observability = instruments in a cockpit. Without them, you're flying blind.

The cockpit analogy

Real users are flying your plane (app). Without instruments, you don't know altitude (LCP), speed (INP), or if the engine is on fire (error rate). You only find out when the plane crashes (user churns, support tickets spike).

RUM = altitude/speed gauges — continuous readings from real flight conditions. Error tracking = engine warning lights. Session recording = the flight data recorder. Distributed tracing = the full flight path. Alerting = the alarm that wakes the pilot.

Monitoring vs Observability

Monitoring

Collect predefined metrics. Alerts when thresholds breach. Tells you something is wrong. Answers: "Is it broken?"

Observability

Explore system state from outputs (logs, traces, metrics). Answers arbitrary questions. Tells you why. Answers: "What broke and where?"

🎛️

Interactive Sandbox

— Move something, see it react instantly.

Pattern

Collect Core Web Vitals from real users via web-vitals.js

Live metrics

LCP (P75)2.1s≤ 2.5sOK

INP (P75)180ms≤ 200msOK

CLS (P75)0.08≤ 0.1OK

FCP (P75)1.4s≤ 1.8sOK

TTFB (P75)620ms≤ 800msOK

Tools:web-vitals.jsGoogle CrUXDataDog RUMVercel Analytics

⚠️ Gotcha: Lighthouse gives you lab data (simulated, one machine). RUM gives you field data (real users, real devices, real networks). Both matter. Field data is the ground truth — your P75 LCP from CrUX is what Google sees.

💡 Insight: Always report P75 (75th percentile), not median (P50). Half your users could have bad experiences while the median looks fine. Google uses P75 for Core Web Vitals assessment. P75 captures the long tail.

Visited:📊🚨📹🔗🔔

🎯

Challenge

Visit all 5 observability patterns. Understand RUM, error tracking, session recording, tracing, and alerting.

Try it

🎯

Why Should I Care?

— The exact interview question + the bug it kills.

Interview questions

Q: What is the difference between monitoring and observability?

Monitoring: you pre-define what to watch (error rate, LCP P75) and alert when thresholds breach. Good for known failure modes. Observability: your system emits enough data (logs, traces, metrics) that you can ask arbitrary questions after something breaks, even questions you didn't anticipate. Observability handles unknown unknowns. Modern teams need both.

Q: Why are source maps required for meaningful error tracking?

Production bundles are minified: all variable names become single letters (a, b, c), whitespace is stripped, and multiple files are concatenated. An error at bundle.js:1:58291 is meaningless without source maps, which map minified positions back to original file names, function names, and line numbers. Upload source maps to Sentry/Datadog on each deploy and exclude them from the public bundle (security risk: source maps expose your source code).

Q: What is a P75 metric and why not P50?

P50 (median) means 50% of users are faster, 50% are slower. If fast users are very fast, the median looks fine even though many users have bad experiences. P75 means 75% of users are at or faster than this value — capturing the experience of users with slower devices/networks. Google uses P75 for Core Web Vitals because it represents the experience of users who aren't in ideal conditions.

Collect CWV in production

1	import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';
2
3	function sendToAnalytics({ name, value, rating }) {
4	// Use sendBeacon for reliable data even on page unload
5	navigator.sendBeacon('/analytics', JSON.stringify({
6	metric: name,
7	value: Math.round(value),
8	rating, // 'good' \| 'needs-improvement' \| 'poor'
9	url: location.href,
10	deviceType: navigator.hardwareConcurrency > 4 ? 'desktop' : 'mobile',
11	}));
12	}
13
14	onLCP(sendToAnalytics);
15	onINP(sendToAnalytics);
16	onCLS(sendToAnalytics);
17	onFCP(sendToAnalytics);
18	onTTFB(sendToAnalytics);

🔬

The Deep Dive

— Spec refs, engine internals, the minutiae.

Core Web Vitals — 2024 thresholds

Metric	Good	Needs Work	Poor
LCP (Largest Contentful Paint)	≤ 2.5s	2.5–4s	> 4s
INP (Interaction to Next Paint)	≤ 200ms	200–500ms	> 500ms
CLS (Cumulative Layout Shift)	≤ 0.1	0.1–0.25	> 0.25
FCP (First Contentful Paint)	≤ 1.8s	1.8–3s	> 3s
TTFB (Time to First Byte)	≤ 800ms	800ms–1.8s	> 1.8s

OpenTelemetry for frontend

1	import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
2	import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
3	import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
4
5	const provider = new WebTracerProvider();
6	provider.addSpanProcessor(
7	new SimpleSpanProcessor(
8	new OTLPTraceExporter({ url: '/v1/traces' })
9	)
10	);
11	provider.register();
12
13	// Auto-instrument: every fetch() now sends traceparent header
14	// and creates a span in the trace
15	registerInstrumentations({
16	instrumentations: [new FetchInstrumentation()],
17	});

Beacon API for reliable data submission

Regular fetch() calls on page unload are cancelled — the browser kills the request when the page closes. navigator.sendBeacon() queues the request to be sent after the page unloads, even if the user navigates away. Use it for analytics, RUM metrics, and error reports.

Limit: 64KB payload. For larger data, batch events and flush periodically during the session, not only on unload.

🎤

Interview Questions

— Real questions from real interviews — with answers.

Monitoring alerts on known failure modes; observability lets you explore arbitrary system state from emitted data.

Minification renames all symbols to single letters — source maps map bundle positions back to original file/line/function names.

P50 hides the experience of users with slow devices or networks; P75 captures the long tail that affects real users.

The browser generates a trace ID and sends it as a traceparent header; each service adds a span and forwards the header downstream.

SLO sets a reliability target; the error budget is the allowed downtime — depleting it fast slows down risky releases.

fetch() is cancelled when the page closes; sendBeacon() queues the request to be sent after the page unloads.

🎮

Memory Game

— Quick quiz — lock the concept in long-term memory.

1/4

What is 'session replay privacy' and what technical controls implement it?