
Stripe's retry logic for failed payments was a monolithic system causing cascading failures under peak load. I led the redesign into an event-driven retry service — reducing failed payment recovery time by 60% and eliminating 3 recurring production incidents per month.
Stripe's payment retry system handled 2M+ daily retry attempts using aging synchronous Rails architecture. During a peak traffic event in Q4 2022, cascading timeouts caused a 45-minute degradation affecting 12,000 merchants and triggering an SEV-1 incident.
Design and implement a resilient, horizontally-scalable retry service capable of handling 10x traffic spikes without degradation, while maintaining full backward compatibility with existing merchant integrations.
Led a 6-person team through a 3-phase migration: (1) extracted retry logic into an isolated Rails service, (2) introduced Kafka-based event queuing with per-merchant partition keys, (3) implemented exponential backoff with jitter and circuit breakers. Wrote comprehensive runbooks and led 2 incident response drills before go-live.
Deployed to 100% traffic over 8 months with zero downtime. Failed payment recovery time fell by 60%. Retry-related incidents dropped to zero. Merchant revenue recovery improved by an estimated $4.2M per month.
P99 Retry Latency
2400ms → 180ms
Monthly SEV Incidents
3 → 0
Payment Recovery Rate
94% → 99.1%
Monthly Revenue Recovered
+$4.2M
“Alex's leadership on the payment retry project was exceptional. The new architecture has been rock-solid since launch and the team learned a huge amount from how he structured the migration.”
— Priya Sharma, Engineering Manager · Stripe