11+ years building and scaling enterprise infrastructure from hardware to cloud, leading cross-functional teams and mentoring engineers through major technology transitions
Led 3 major infrastructure transitions • Mentored 15+ engineers, promoted 6 to senior roles
Managing distributed systems serving 1.5B+ LinkedIn members globally across multiple data centers with industry-leading reliability and performance
99.95% uptime for critical services • Global infrastructure serving 900M+ users
Managing hundreds of clusters with thousands of nodes and millions of containers, driving reliability engineering and performance optimization at hyperscale
10,000+ containers across regions • Control-plane CPU reduced 40% through optimization
Passionate about intelligent infrastructure automation using LangChain and AI agents. For anomaly detection, automated operations, and next-generation SRE practices
1,500+ engineering hours saved annually • 80% reduction in manual operations
Taming clusters and pushing limits. (Kubernetes @ scale)
• Currently leading stress testing for Kubernetes infrastructure, building a framework of stress scenarios to simulate both real production traffic and synthetic workloads, leveraging Grafana k6, kwok, and k8s itself to push systems to their limits
• Developing AI-driven performance analysis using LangChain agents to mine metrics, detect anomalies, and auto-generate performance reports for capacity planning
• Developed automated pipelines for cluster buildout automation, creating templating tools that streamline deployment configurations and enable rapid cluster provisioning
Herding yellow elephants. Fixing things, breaking things, and having fun! (Hadoop & Grid)
• Led a multi-year transformation of a petabyte-scale Hadoop platform (HDFS, YARN, Spark, Hive, etc.) from a legacy and outdated deployment stack leveraging configuration management and Jenkins to a modern stack leveraging both Kubernetes and an internal deployment orchestration platform. This involved defining the vision, securing funding, and driving cross-team execution
• Served as SRE lead for Grid's Azure cloud migration initiative, leading efforts to modernize and deploy 30k+ node Hadoop infrastructure on Azure VMs for public cloud deployment. Although the Azure project was later cancelled, it provided crucial cloud experience and directly influenced the subsequent on-premises modernization strategy
• Developed CIR (Can I Reimage) system saving 1000+ engineering hours by automating infrastructure approval workflows
• Mentoring engineers and driving AI adoption initiatives, including team vibe coding sessions, building Glean-powered automation tools for operational efficiency, and Slack bots for easy access
Designed and implemented a centralized Break-Fix automation system for Hadoop which after detecting hardware faults files tickets with local data centers, waits for repair, and then re-integrates nodes back into their respective clusters. Estimated to save ~1,500 engineering hours/year.
• Migrated entire Hadoop ecosystem from custom infrastructure stack to company-wide Common Operating Platform, retiring legacy systems like Cobbler and BCFG2 while maintaining operational agility
• Built automated LDAP-to-ACL synchronization system, eliminating manual network rule management and creating the first fully automated address group management at company scale
• Led enterprise security initiatives including Grid's distributed firewall rollout across all clusters and implementation of global ACL templates, significantly strengthening infrastructure security posture
• Designed and deployed Cluster Manager v2, a Swiss Army knife automation platform for end-to-end cluster lifecycle management, handling 100+ automated expansions and decommissions
• Spearheaded RHEL 7 migration across the entire Hadoop fleet, including complex kernel tuning and cgroups configuration for performance optimization
Apache, Hadoop, and yellow elephants. Managing the entire Hadoop infrastructure from driving vendor decisions and ordering hardware to imaging servers and supporting services.
• Maintained Cobbler kickstart infrastructure for automated server provisioning and OS deployment across thousands of nodes
• Implemented Kerberos + LDAP authentication systems for secure Hadoop clusters and user access management
• Used BCFG2 for configuration management, ensuring consistent system configurations across the entire Grid environment
• Deployed and maintained Nagios monitoring systems for infrastructure health checks, alerting, and performance metrics
• Deep systems troubleshooting, diving into kernel-level issues and low-level hardware problems
I'm always excited to discuss complex technical challenges and opportunities to build amazing systems.