StripeFintech

Rebuilding Payment Retry Logic at Scale

Stripe's retry logic for failed payments was a monolithic system causing cascading failures under peak load. I led the redesign into an event-driven retry service — reducing failed payment recovery time by 60% and eliminating 3 recurring production incidents per month.

StripeLead EngineerFintech8 months

Situation

Stripe's payment retry system handled 2M+ daily retry attempts using aging synchronous Rails architecture. During a peak traffic event in Q4 2022, cascading timeouts caused a 45-minute degradation affecting 12,000 merchants and triggering an SEV-1 incident.

Task

Design and implement a resilient, horizontally-scalable retry service capable of handling 10x traffic spikes without degradation, while maintaining full backward compatibility with existing merchant integrations.

Action

Led a 6-person team through a 3-phase migration: (1) extracted retry logic into an isolated Rails service, (2) introduced Kafka-based event queuing with per-merchant partition keys, (3) implemented exponential backoff with jitter and circuit breakers. Wrote comprehensive runbooks and led 2 incident response drills before go-live.

Result

Deployed to 100% traffic over 8 months with zero downtime. Failed payment recovery time fell by 60%. Retry-related incidents dropped to zero. Merchant revenue recovery improved by an estimated $4.2M per month.

Key Metrics

P99 Retry Latency

2400ms → 180ms

Monthly SEV Incidents

3 → 0

Payment Recovery Rate

94% → 99.1%

Monthly Revenue Recovered

+$4.2M

Testimonial

“Alex's leadership on the payment retry project was exceptional. The new architecture has been rock-solid since launch and the team learned a huge amount from how he structured the migration.”

— Priya Sharma, Engineering Manager · Stripe