Frontend Observability & RUM
โฑ๏ธ ~3-minute bite ยท solve the sandbox to master
5-Year-Old Metaphor
โ The physical, real-world picture. No jargon.โ๏ธ Observability = instruments in a cockpit. Without them, you're flying blind.
The cockpit analogy
Real users are flying your plane (app). Without instruments, you don't know altitude (LCP), speed (INP), or if the engine is on fire (error rate). You only find out when the plane crashes (user churns, support tickets spike).
RUM = altitude/speed gauges โ continuous readings from real flight conditions. Error tracking = engine warning lights. Session recording = the flight data recorder. Distributed tracing = the full flight path. Alerting = the alarm that wakes the pilot.
Monitoring vs Observability
Monitoring
Collect predefined metrics. Alerts when thresholds breach. Tells you something is wrong. Answers: "Is it broken?"
Observability
Explore system state from outputs (logs, traces, metrics). Answers arbitrary questions. Tells you why. Answers: "What broke and where?"
Interactive Sandbox
โ Move something, see it react instantly.Pattern
Collect Core Web Vitals from real users via web-vitals.js
Live metrics
Challenge
Visit all 5 observability patterns. Understand RUM, error tracking, session recording, tracing, and alerting.
Why Should I Care?
โ The exact interview question + the bug it kills.Interview questions
Q: What is the difference between monitoring and observability?
Monitoring: you pre-define what to watch (error rate, LCP P75) and alert when thresholds breach. Good for known failure modes. Observability: your system emits enough data (logs, traces, metrics) that you can ask arbitrary questions after something breaks, even questions you didn't anticipate. Observability handles unknown unknowns. Modern teams need both.
Q: Why are source maps required for meaningful error tracking?
Production bundles are minified: all variable names become single letters (a, b, c), whitespace is stripped, and multiple files are concatenated. An error at bundle.js:1:58291 is meaningless without source maps, which map minified positions back to original file names, function names, and line numbers. Upload source maps to Sentry/Datadog on each deploy and exclude them from the public bundle (security risk: source maps expose your source code).
Q: What is a P75 metric and why not P50?
P50 (median) means 50% of users are faster, 50% are slower. If fast users are very fast, the median looks fine even though many users have bad experiences. P75 means 75% of users are at or faster than this value โ capturing the experience of users with slower devices/networks. Google uses P75 for Core Web Vitals because it represents the experience of users who aren't in ideal conditions.
Collect CWV in production
| 1 | import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals'; |
| 2 | ย |
| 3 | function sendToAnalytics({ name, value, rating }) { |
| 4 | // Use sendBeacon for reliable data even on page unload |
| 5 | navigator.sendBeacon('/analytics', JSON.stringify({ |
| 6 | metric: name, |
| 7 | value: Math.round(value), |
| 8 | rating, // 'good' | 'needs-improvement' | 'poor' |
| 9 | url: location.href, |
| 10 | deviceType: navigator.hardwareConcurrency > 4 ? 'desktop' : 'mobile', |
| 11 | })); |
| 12 | } |
| 13 | ย |
| 14 | onLCP(sendToAnalytics); |
| 15 | onINP(sendToAnalytics); |
| 16 | onCLS(sendToAnalytics); |
| 17 | onFCP(sendToAnalytics); |
| 18 | onTTFB(sendToAnalytics); |
The Deep Dive
โ Spec refs, engine internals, the minutiae.Core Web Vitals โ 2024 thresholds
| Metric | Good | Needs Work | Poor |
|---|---|---|---|
| LCP (Largest Contentful Paint) | โค 2.5s | 2.5โ4s | > 4s |
| INP (Interaction to Next Paint) | โค 200ms | 200โ500ms | > 500ms |
| CLS (Cumulative Layout Shift) | โค 0.1 | 0.1โ0.25 | > 0.25 |
| FCP (First Contentful Paint) | โค 1.8s | 1.8โ3s | > 3s |
| TTFB (Time to First Byte) | โค 800ms | 800msโ1.8s | > 1.8s |
OpenTelemetry for frontend
| 1 | import { WebTracerProvider } from '@opentelemetry/sdk-trace-web'; |
| 2 | import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch'; |
| 3 | import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; |
| 4 | ย |
| 5 | const provider = new WebTracerProvider(); |
| 6 | provider.addSpanProcessor( |
| 7 | new SimpleSpanProcessor( |
| 8 | new OTLPTraceExporter({ url: '/v1/traces' }) |
| 9 | ) |
| 10 | ); |
| 11 | provider.register(); |
| 12 | ย |
| 13 | // Auto-instrument: every fetch() now sends traceparent header |
| 14 | // and creates a span in the trace |
| 15 | registerInstrumentations({ |
| 16 | instrumentations: [new FetchInstrumentation()], |
| 17 | }); |
Beacon API for reliable data submission
Regular fetch() calls on page unload are cancelled โ the browser kills the request when the page closes. navigator.sendBeacon() queues the request to be sent after the page unloads, even if the user navigates away. Use it for analytics, RUM metrics, and error reports.
Limit: 64KB payload. For larger data, batch events and flush periodically during the session, not only on unload.
Interview Questions
โ Real questions from real interviews โ with answers.Monitoring alerts on known failure modes; observability lets you explore arbitrary system state from emitted data.
Minification renames all symbols to single letters โ source maps map bundle positions back to original file/line/function names.
P50 hides the experience of users with slow devices or networks; P75 captures the long tail that affects real users.
The browser generates a trace ID and sends it as a traceparent header; each service adds a span and forwards the header downstream.
SLO sets a reliability target; the error budget is the allowed downtime โ depleting it fast slows down risky releases.
fetch() is cancelled when the page closes; sendBeacon() queues the request to be sent after the page unloads.
Memory Game
โ Quick quiz โ lock the concept in long-term memory.What is 'session replay privacy' and what technical controls implement it?