Real-time SMS Notification Platform

Rebuilt Twilio's internal notification delivery pipeline from a polling-based architecture to an event-driven Kafka system — lifting the delivery SLA from 94% to 99.7% and slashing median latency from 4.2 seconds to 340ms.

TwilioSoftware EngineerCommunications6 months

Situation

Twilio's internal notification system used a polling-based Rails worker that couldn't keep pace with 80M+ daily messages. Delivery delays were causing SLA breaches for enterprise customers and the system had no capacity headroom for growth.

Task

Migrate from polling to event-driven architecture while maintaining delivery guarantees and achieving zero-downtime cutover of 80M daily messages.

Action

Designed a Kafka-based event pipeline with partitioned consumer groups per message priority tier. Built idempotent delivery handlers with exponential backoff and dead-letter queues. Ran both systems in shadow mode in parallel for 6 weeks before full cutover, validating delivery parity at each traffic milestone.

Result

Delivery SLA improved from 94% to 99.7%. Median delivery latency dropped from 4.2s to 340ms. System now has 3x headroom above peak observed load. Zero messages lost during migration.

Key Metrics

Delivery SLA

94% → 99.7%

Median Latency

4200ms → 340ms

Capacity Headroom

3x

Messages Lost in Migration

0