Featured Initiatives

Major projects driving infrastructure transformation

Kubernetes Infra Stress Testing

Building

Building a comprehensive testing framework to replay production and synthetic traffic, leveraging Grafana-k6 test suite (wrapped as "k6pack") to push Kubernetes control-plane limits. Added AI-driven post-analysis: LangChain agents mine ForgeFire metrics, MDM and AMF data, flag anomalies, and auto-generate performance reports. The framework has already uncovered GC tuning opportunities—saving cores across production API servers—and is now a reference for capacity planning demos across the org.

Impact:

Provides repeatable load patterns and automated insight, enabling data-backed tuning that cut control-plane CPU and established a blueprint for future scale tests.

KubernetesGrafana k6GoLangChainPrometheus

Modernizing Grid Infrastructure

Production

Championed and architected a multi-year transformation of a petabyte-scale Hadoop platform to cloud-native Kubernetes. Defined the vision, secured funding, and drove cross-team execution—standardising builds, introducing GitOps-based continuous delivery, and containerizing multiple data services (HDFS and YARN on VMs, Spark, Trino, Hive on Kubernetes).

Impact:

Enabled self-service deployments for multiple engineering teams, cut release cycles from weeks to hours, and established the blueprint now used for every new data-platform cluster.

KubernetesHelmGitOpsPythonHadoop

Hadoop Cloud Migration (Azure)

Cancelled

Led the effort to port a Hadoop ecosystem to the public cloud. Deployed legacy services on Azure VMs, introduced GitOps pipelines, and stood up the first fully-functional Grid cluster in Azure—complete with data-parity validation against on-prem. Although the program was later paused, it proved large-scale Hadoop can run natively in the cloud and gave the team deep production experience with Azure and cloud-native infrastructure.

Impact:

Demonstrated cloud viability for Hadoop, influenced on-prem modernisation roadmap, and up-skilled the team in cloud operations and observability.

AzureKubernetesAirflowGitOpsHadoop

Break-Fix Automation

Production

Patrolling hardware faults manually once consumed significant on-call time. I designed and led the roll-out of an end-to-end lifecycle framework that detects server faults, files data-centre tickets, re-images hardware once repaired, and safely reintegrates nodes into the Hadoop cluster—all without human intervention.

Impact:

The system now handles thousands of faults a year and has saved the team many thousands of engineering hours—roughly 150+ hours per year since launch and growing as coverage expands.

PythonJenkinsNagiosREST APIsHadoop

LinkFS & Trino Migration to Kubernetes

Production

Authored the core tooling, Helm charts, and GitOps workflows that unlocked the move of LinkFS (distributed file service) and Trino (SQL-on-Anything engine) from bespoke infrastructure to Kubernetes. Introduced DNS-discovery health checks to replace legacy eBGP VIPs—enabling true load-balancing and horizontal scaling. Delivered production-ready proof-of-concept clusters that became the rollout blueprint for the full migration.

Impact:

Validated cloud-native approach and shaved months off the programme timeline; the patterns now power production clusters serving petabyte-scale analytics workloads.

KubernetesHelmGitOpsDNS DiscoveryTrinoLinkFS

Let's Build Something Amazing

Interested in collaborating on infrastructure challenges or discussing innovative solutions?