Alex Augsburger - Software Engineer & Adventurer

Technical Expertise

Technical Leadership & Experience

11+ years building and scaling enterprise infrastructure from hardware to cloud, leading cross-functional teams and mentoring engineers through major technology transitions

Key Technologies

GoPythonKubernetesTechnical MentoringArchitecture Design

Led 3 major infrastructure transitions • Mentored 15+ engineers, promoted 6 to senior roles

Hyperscale Infrastructure & Reliability

Managing distributed systems serving 1.5B+ LinkedIn members globally across multiple data centers with industry-leading reliability and performance

Key Technologies

Distributed SystemsLoad BalancingObservabilitySREIncident Response

99.95% uptime for critical services • Global infrastructure serving 900M+ users

Kubernetes at Enterprise Scale

Managing hundreds of clusters with thousands of nodes and millions of containers, driving reliability engineering and performance optimization at hyperscale

Key Technologies

KubernetesPerformance TuningGrafana k6HelmGitOps

10,000+ containers across regions • Control-plane CPU reduced 40% through optimization

AI-Powered Automation & Innovation

Passionate about intelligent infrastructure automation using LangChain and AI agents. For anomaly detection, automated operations, and next-generation SRE practices

Key Technologies

LangChainAI/MLAutomationPythonInfrastructure as Code

1,500+ engineering hours saved annually • 80% reduction in manual operations

Career Journey

Current Role

Sr. Staff Software Engineer

Apr 2025 - Present

San Francisco, CA

Taming clusters and pushing limits. (Kubernetes @ scale)

• Currently leading stress testing for Kubernetes infrastructure, building a framework of stress scenarios to simulate both real production traffic and synthetic workloads, leveraging Grafana k6, kwok, and k8s itself to push systems to their limits
• Developing AI-driven performance analysis using LangChain agents to mine metrics, detect anomalies, and auto-generate performance reports for capacity planning
• Developed automated pipelines for cluster buildout automation, creating templating tools that streamline deployment configurations and enable rapid cluster provisioning

KubernetesGoPythonGrafana k6LangChainAI/MLPerformance TuningHelmGitOpsAutomation

Staff Engineer

Mar 2020 - Apr 2025

Mountain View, CA

Herding yellow elephants. Fixing things, breaking things, and having fun! (Hadoop & Grid)

• Led a multi-year transformation of a petabyte-scale Hadoop platform (HDFS, YARN, Spark, Hive, etc.) from a legacy and outdated deployment stack leveraging configuration management and Jenkins to a modern stack leveraging both Kubernetes and an internal deployment orchestration platform. This involved defining the vision, securing funding, and driving cross-team execution
• Served as SRE lead for Grid's Azure cloud migration initiative, leading efforts to modernize and deploy 30k+ node Hadoop infrastructure on Azure VMs for public cloud deployment. Although the Azure project was later cancelled, it provided crucial cloud experience and directly influenced the subsequent on-premises modernization strategy
• Developed CIR (Can I Reimage) system saving 1000+ engineering hours by automating infrastructure approval workflows
• Mentoring engineers and driving AI adoption initiatives, including team vibe coding sessions, building Glean-powered automation tools for operational efficiency, and Slack bots for easy access

HadoopApache SparkMapReduceHDFSKubernetesAzureDockerGitOpsCFEngineLDAPSecurityAutomationAI/MLCIRTrino

Senior SRE, Hadoop

Mar 2017 - Mar 2020

Mountain View, CA

Designed and implemented a centralized Break-Fix automation system for Hadoop which after detecting hardware faults files tickets with local data centers, waits for repair, and then re-integrates nodes back into their respective clusters. Estimated to save ~1,500 engineering hours/year.

• Migrated entire Hadoop ecosystem from custom infrastructure stack to company-wide Common Operating Platform, retiring legacy systems like Cobbler and BCFG2 while maintaining operational agility
• Built automated LDAP-to-ACL synchronization system, eliminating manual network rule management and creating the first fully automated address group management at company scale
• Led enterprise security initiatives including Grid's distributed firewall rollout across all clusters and implementation of global ACL templates, significantly strengthening infrastructure security posture
• Designed and deployed Cluster Manager v2, a Swiss Army knife automation platform for end-to-end cluster lifecycle management, handling 100+ automated expansions and decommissions
• Spearheaded RHEL 7 migration across the entire Hadoop fleet, including complex kernel tuning and cgroups configuration for performance optimization

PythonHadoopComputer HardwareTroubleshootingCFEngineLDAPSecurityRHELAutomation

Engineer, Grid Systems

Aug 2015 - Mar 2017

Mountain View, CA

Apache, Hadoop, and yellow elephants. Managing the entire Hadoop infrastructure from driving vendor decisions and ordering hardware to imaging servers and supporting services.

• Maintained Cobbler kickstart infrastructure for automated server provisioning and OS deployment across thousands of nodes
• Implemented Kerberos + LDAP authentication systems for secure Hadoop clusters and user access management
• Used BCFG2 for configuration management, ensuring consistent system configurations across the entire Grid environment
• Deployed and maintained Nagios monitoring systems for infrastructure health checks, alerting, and performance metrics
• Deep systems troubleshooting, diving into kernel-level issues and low-level hardware problems

BCFG2LDAPJenkinsKerberosGangliaNagiosCobblerRedHat

Interested in Working Together?

I'm always excited to discuss complex technical challenges and opportunities to build amazing systems.

Get In Touch View Projects