Major projects driving infrastructure transformation
Building a comprehensive testing framework to replay production and synthetic traffic, leveraging Grafana-k6 test suite (wrapped as "k6pack") to push Kubernetes control-plane limits. Added AI-driven post-analysis: LangChain agents mine ForgeFire metrics, MDM and AMF data, flag anomalies, and auto-generate performance reports. The framework has already uncovered GC tuning opportunities—saving cores across production API servers—and is now a reference for capacity planning demos across the org.
Provides repeatable load patterns and automated insight, enabling data-backed tuning that cut control-plane CPU and established a blueprint for future scale tests.
Championed and architected a multi-year transformation of a petabyte-scale Hadoop platform to cloud-native Kubernetes. Defined the vision, secured funding, and drove cross-team execution—standardising builds, introducing GitOps-based continuous delivery, and containerizing multiple data services (HDFS and YARN on VMs, Spark, Trino, Hive on Kubernetes).
Enabled self-service deployments for multiple engineering teams, cut release cycles from weeks to hours, and established the blueprint now used for every new data-platform cluster.
Led the effort to port a Hadoop ecosystem to the public cloud. Deployed legacy services on Azure VMs, introduced GitOps pipelines, and stood up the first fully-functional Grid cluster in Azure—complete with data-parity validation against on-prem. Although the program was later paused, it proved large-scale Hadoop can run natively in the cloud and gave the team deep production experience with Azure and cloud-native infrastructure.
Demonstrated cloud viability for Hadoop, influenced on-prem modernisation roadmap, and up-skilled the team in cloud operations and observability.
Patrolling hardware faults manually once consumed significant on-call time. I designed and led the roll-out of an end-to-end lifecycle framework that detects server faults, files data-centre tickets, re-images hardware once repaired, and safely reintegrates nodes into the Hadoop cluster—all without human intervention.
The system now handles thousands of faults a year and has saved the team many thousands of engineering hours—roughly 150+ hours per year since launch and growing as coverage expands.
Authored the core tooling, Helm charts, and GitOps workflows that unlocked the move of LinkFS (distributed file service) and Trino (SQL-on-Anything engine) from bespoke infrastructure to Kubernetes. Introduced DNS-discovery health checks to replace legacy eBGP VIPs—enabling true load-balancing and horizontal scaling. Delivered production-ready proof-of-concept clusters that became the rollout blueprint for the full migration.
Validated cloud-native approach and shaved months off the programme timeline; the patterns now power production clusters serving petabyte-scale analytics workloads.
Interested in collaborating on infrastructure challenges or discussing innovative solutions?