Contents
Back

September 2023 - January 2024

Cloud Recomposition

From fragile and costly to resilient and high-performing, cloud architecture built for rapid growth.

-62%
cost savings on cloud operations
99.08%
System uptime since crash
3x growth capacity
ready for the demand spikes

Overview

In September 2023, a sudden surge in user activity pushed our platform to its limits and exposed long-standing weaknesses in our cloud setup.

The infrastructure powers a social e-commerce platform where users engage like on social media while also browsing, ordering, and managing physical products through an integrated checkout and fulfillment flow.

Performance bottlenecks, unpredictable scaling, and rising cloud costs quickly became major challenges. As traffic increased, the platform struggled to handle both real-time activity and transactional workloads such as orders, payments, and inventory updates.

Over time, the infrastructure had also grown bloated, driving expenses higher than necessary. This project set out to re-architect the system to make it leaner, more scalable, and resilient, ensuring it could reliably support both engagement and commerce in the next phase of growth.

The Challenge

As the platform began to grow, the existing cloud setup stopped supporting the business and started holding it back. What once worked in the early stages became expensive, fragile, and unable to handle real demand.

Cost Inefficiency

Our AWS bill was already enormous, yet the system still wasn't good enough. Overprovisioned resources and poor optimization meant we were paying more every month without gaining reliability or performance. Costs kept rising, but business value didn't.

Scalability Limitations

When traffic suddenly increased, the system didn't just slow down, it began to collapse. Users couldn't interact with the platform, and core features stopped working. Moments that should have driven engagement and sales instead caused disruption, putting both customer experience and revenue at risk. At its worst, the entire business model was under threat.

Reliability Gaps

Weak fault tolerance and limited visibility meant small issues could quickly turn into major outages. The team was forced into reactive firefighting, while users experienced downtime that directly affected trust in the product and confidence in the business.

Technologies Used

ECS EC2 MongoDB Cloudwatch DataDdog s3 Docker Lambda WAF Redis/Elasticache

Approach

September 2023
Initial Assessment
Emergency audited the existing infrastructure to uncover performance bottlenecks, cost inefficiencies, and scalability gaps exposed during the traffic surge.
October 2023
Initial Optimization
Optimized EC2 configurations and scaling policies, while collaborating with the backend team to fix slow MongoDB queries that had been dragging performance.
November 2023
Architecture Redesign & Scaling Adjustments
Retained ECS as the core but overhauled its auto-scaling strategy, ensuring resources scaled smoothly with demand instead of failing under spikes.
December 2023
Cost Optimization & OPEX Reduction
Conducted a service-by-service audit, reconfiguring overprovisioned resources, optimizing S3 and CloudFront usage, and rightsizing EC2 instances. Attempting to cut costs without sacrificing performance.
January 2024
Monitoring & Observability with CloudWatch
Enhanced monitoring with CloudWatch dashboards, alerts, and metrics, giving the team real-time visibility into system health and enabling proactive incident response.

Key Implementations

ECS Optimization & Auto-Scaling Adjustments

Fine-tuned ECS auto-scaling to respond dynamically to real-time traffic. The system now scales up smoothly during surges and contracts during off-peak hours, improving performance while cutting unnecessary costs.

Mongo DB Performance Tuning

Identified and flagged inefficient queries that were slowing down the platform. Partnered with the backend team to optimize them, reducing latency and boosting overall database efficiency.

Cost Optimization

Audited all active cloud services, eliminating overprovisioned resources and rightsizing EC2 instances. Optimized S3 storage and CloudFront delivery, achieving major cost savings without trade-offs in performance.

Cloud image dashboard

Results & Impact

62% Lower Cloud Costs

Infrastructure spending was reduced by more than half. What had been a growing financial burden became a controlled, predictable cost, freeing budget for product development and business growth instead of overhead.

99.08% Uptime

The platform stayed available when it mattered most. Users could interact, browse, and place orders without interruptions, protecting customer trust and avoiding lost revenue during peak activity.

3x scalability

The system can now handle three times more users without breaking. Traffic surges that once caused failures are now absorbed smoothly, turning demand into growth instead of disruption.

Operational Agility

The team gained clear visibility into what was happening across the platform and could act before problems reached users. Issues are resolved faster, business disruption is reduced, and day-to-day operations run with far less friction.

Conclusion

This transformation turned a fragile and expensive setup into a stable, efficient, and growth-ready platform. What once limited the business now supports it, providing a foundation that is reliable, scalable, and financially sustainable.

More than a technical upgrade, it shows how focused improvements can change business outcomes. Smarter scaling, better visibility, and tighter cost control reduced risk, protected revenue, and made growth predictable instead of painful. The result is a platform built not just to run, but to grow with confidence.

Want to discuss this project?

I'm always happy to share more details about the cloud transformation or discuss how similar approaches could benefit your organization.

Explore Other Projects

Project Leadership

Project Ops Leadership

Led global team of 10+ developers, saved €200k+

Masterclass Platform

Masterclass Platform

Built scalable learning platform serving 1000+ active users