Architect next-generation on-prem infrastructure for online game backend services with a focus on performance, reliability and ease of adoption and visibility.
- Implemented game backend services in docker containers for initial proof-of-concept.
- Selected docker-swarm for orchestration as it provided the cleanest migration path for traditional lamp stack.
- Automated ingress, load-balancing, vhost routing and ssl management with Traefik.
- Services that were deployed with pxe, fabric and ansible left them prone to config-drift with risky upgrades and limited recovery. Using SwarmCD, containerized services and configurations are now tested, promoted and deployed via simple GitOps workflow.
- Modernized aging code deployments which used a combination of Jenkins jobs and Fabric or Ansible scripts, replacing them with simple and performant CephFS shares providing faster deployments and eliminating consistency issues with per-host deployment
- This environment was easily replicated onto stage and test environment and local development environments and greatly streamlined dev and qa processes.
- I improved visibility into backend services with per-game tailored grafana alloy logging pipelines and accompanying grafana dashboards which provided rich log search capabilities, faster error reporting and accurate anomaly detection; handlin nearly 30,000 requests per second.
- Built out a bare-metal provisioning Proof-of-Concept using Tinkerbell deployed on a small kubernetes cluster with accompanying Terraform project. This lab successfully demonstrated physical hardware lifecycle management with the same tools and processes used in existing Amazon EKS projects.
Building modular custom build images, tools and workflows for deploying and configuring highly specific systems.
- Developed and implemented custom provisioning images and tools for vCMTS (virtual cable modem termination system) appliances. This work was critical to rolling out next-generation DOCSIS networks by ensuring these systems could be provisioned, tested and configured quickly and reliably without additional hardware or licensing overhead.
- I used Canonical MAAS to handle discovery and PXE tasks to boostrap the imaging environment, Ansible and Hashicorp Packer to create the in-memory build images and OSTree to apply the vCMTS testing and installation framework. The testing and configuration processes were implemented in Docker.
Supporting CDN Team initiative to migrate manually-managed Akamai rules to configuration-as-code
- Support initiative to modernize/automate CDN Environments
- As part of the CDN (Akamai) Team, I was tasked with creating scripts and workflows to assist our team in supporting our internal customer needs.
- Worked alongside our principal architect during early planning stages to codify CDN configuration and bring existing and future environments into sync and under revision control.
- Rebuilt a suite of utilities used by dev/ops teams across the business to shape and validate web traffic across development and production environments.
- Built new CDN purge utilities as part of a mentoring project with a student intern. Over a series of paired working calls, we created and packaged a CLI tool and an associated AWS Lambda function that can be easily used by individuals or as part of automation processes. While this was a relatively simple script, we discussed and implemented common and recommended tools for revision control (Git), dependency management (Pyenv, Poetry), package management (PyPI/Homebrew), and process automation (Ansible). The student presented the project to our team's management chain as a successful goal for the internship.
Providing custom-tailored OpenStack and automation solutions for high-value clients.
- Worked on-site with a very high-value client who wished to validate OpenStack reliability on an upcoming generation of hardware and rethink their internal cloud design. Together we developed tools to automate containerized deployment (pre- Docker-OpenStack) and I guided the team in testing, deploying, managing, and monitoring their OpenStack cloud. My unique, intuitive, and maintainable design prompted immediate praise from the client, for which I earned the WWT Global Service Provider - Engineer of The Year Award.
- Developed an automated build pipeline using Ansible Tower and Foreman, which, given minimal specifications, allows our technicians to deploy the resources necessary to provision and test hundreds of systems each day; which are shipped to our customers fully racked and cabled for immediate integration.
- Created an environment for testing and metrics collection for a prototype mesh storage solution. This utilized nVME drives and remote direct disk access (over ethernet) to provide ultra-low latency with ultra-high i/o operations.
- Worked with the Principle Ceph Storage Engineer to better understand the management, deployment, and performance tuning of Ceph clusters.
- Consulted the first cross-divisional project team in building an automated solution to verify and remediate the hardware configuration, health status, firmware, and operating system version. Provided guidance related to setting up a PXE boot environment for hardware and system configuration, automated firmware remediation, and reporting.
Guide the creation of new API, services and ETL tools to migrate application data from an obsolete backend with minimal customer impact.
- Responsible for the redesign of the infrastructure to run the mobile application, revamping the entire backend infrastructure to a hybrid platform on AWS, OpenStack, and KVM+Vagrant. My team was responsible for developing API services using Node.js, Express.js, Redis, Percona XtradDB Cluster, and Nginx to facilitate migration from Facebook's Parse Mobile App Platform.
- Built out monitoring and alerting systems with Icinga, ELK stack, and TIG (Telegraf, InfluxDB, Grafana) stack.
- Deployed Salt for configuration, control, self-healing, and rapid deployment/scale of systems and services in a hybrid cloud environment.
- Technologies used for infrastructure include Percona XtraDB Cluster, Salt, Icinga, Redis, Nginx, HAProxy, Node.js, ElasticSearch, Jenkins, KVM, Vagrant, Amazon Web Services (EC2, S3 Primarily), LDAP, PFSense, OpenVPN.
Architect and develop tools for on-prem multi-site OpenStack cluster, deployed on Cisco UCS and NetApp hardware. Oversee migration of bare-metal services into virtualized, IAAS environment.
- Worked with a small high-output team responsible for maintaining site reliability while also transitioning all significant services to an internal OpenStack cloud.
- Principal engineer of private cloud environment using OpenStack (Grizzly, Icehouse) on Ubuntu (12.04, 14.04 respectively) built on Cisco UCS B-Series blade hardware, and backed by NetApp clustered NFS storage.
- Shared responsibility for 22 Petabytes of storage, spread over 72 NetApp appliances (FAS6040, FAS3240, FAS3210, FAS2240, FAS2020) running OnTap 7.1.1 - 8.1.1 in 7-Mode and Clustered-Mode.
- Developed custom storage balancing tools required to move roughly 6 Petabytes of data out of NetApp TradVols and into 64-bit FlexVols. Managed rebalancing data across the existing 32-bit FlexVols to begin the upgrade process to 64-bit FlexVols essential to the release and ramp-up of the site redesign, given the data-ingest projections.
- Headed a storage initiative to compare flash-based PureStorage against NetApp's latest flash-based storage as the backend for our newly virtualized databases. Technologies used in this project allowed me to learn about fiber-channel switching/zoning configuration on Cisco UCS 6200 Fabric-Interconnects, Linux FCoE, and multipathing, as well as configuring the KVM virtual machines to boot from SAN.
Provide large-scale architecture and automation expertise to modernize application stack and improve uptime and performance.
- Managed the deployment of the Yahoo! Contributor Network including network and hardware specifications, system provisioning, QA process, and launch.
- Provided support to our in-house development team to leverage new Yahoo! technology including their cloud computing platform which now handles all Associated Content analytics and content (content agility) to feed the content across Yahoo!'s entire collection of properties.
- Responsible for 100% availability of Associated Content and Yahoo! Contributor Network consisting of 340 RHEL5 servers.
- Managed Yahoo Voices launch and affected changes to improve scalability and automation.
- Architected parallel code deployment system for dev, stage, QA, and production environments using Git. This resulted in full deploys in under 90 seconds and featured tiered-deploy and full rollback-to-stable capabilities.
- Managed the Denver office integration with Yahoo's corporate network, completing the migration several months ahead of Yahoo! projections. This included building out the IDF room to spec concerning cooling and power, redundant network stacks, cabling, and buildout/migration of all development environments.
- Co-managed the Yahoo Contributor Network site release in the US, UK, and Brazil.
- Worked closely with our development team to consistently improve site and application performance as our contributor base grew to over 500,000 and our publishing platform moved towards internationalization, planning for growth accordingly.
Leverage previous cPanel and WHM experience to improve product offerings. Mentor junior admins and support staff to improve ticket resolution times and customer support experience, leading to an increase in reported customer satisfaction from 75% to 90%.
- Responsible for maintaining the hardware and software of more than 250 CentOS servers in a high-availability environment.
- Through my redesign of the existing kickstart provisioning build system, we achieved nearly zero-touch provisioning of systems.
- Implemented a new, fully redundant, multi-gigabit network using 25-pair / 50-pin telco modules to eliminate rack-confined switches, greatly simplifying network architecture.
- Deployed a network of serial console servers with improved network boot options, reducing the need for hands-on software repairs by 60%.
- Deployed in-house iSCSI storage solution on Supermicro chassis to migrate onboard storage of customer data to more robust and secure SANs.
- Responsible for fully training our systems administration team focused on advanced troubleshooting techniques, system maintenance tooling and procedures, and methods for providing exceptional customer care.
Offering high-performance multi-player Game Servers bundled with voice chat and web hosting services tailored to a variety of customer requirements and budgets.
- Built, racked, cabled, and configured 50 Fedora Core 4 webservers.
- Maintained remote and on-site hardware, software, installation, and deployments of cPanel and WHM software.
- Executed deployments of ModernBill and then WHMCS billing and client areas.
- Configured DNS clustering with BIND.
- Managed our Apache webserver deployment as one of the first companies offering developer-friendly PHP 5 web hosting in addition to multiple PHP-CGI options now utilized by dozens of shared hosting vendors.
- Oversaw our upgrade to PHP5, ensuring that the change went smoothly for customers, and managed the manual migration of over 500 websites (including email) from a non-cPanel environment to our systems.
- Deployed our internal network to handle monitoring, backups, and remote administration.
- Solely responsible for streamlining the customer sales and support processes to ensure efficiency and an excellent user experience.
Highly available compute cluster with tiered distributed storage dedicated to home automation, home security cameras, intercoms, media services, machine-learning, image recognition and R&D projects.
- The hardware infrastructure consists of 5 nodes with 240 CPU cores, 512GB RAM, Nvidia Tesla T4 GPUs (in GRID vGPU mode for VM workloads), 5x Nvidia RTX3050 GPUs for containerized workloads and 1x RTX5060 for machine-learning processes.
- QEMU/KVM virtual machines provide the secure swarm nodes and a handful of specialized services where vGPU passthrough is desired.
- Docker Swarm is used for container orchestration and is bootstrapped with Traefik, which provides ingress, load balancing, SSL termination and SSL certificate management, a private docker registry and several caching registry proxies and SwarmDC which provides GitOps lifecycle management for containers.
- SwarmCD deploys all remaining services and enables CI/CD for all immutable configurations and manages sops-encoded docker secrets for sensitive data. These are automatically versioned and rolled out to containers on merge, allowing for zero-downtime updates.
- Persistent container volumes are provisioned from a 10TB nVME Ceph Cluster using a custom-built docker volume plugin for RADOS block devices.
- A second 300TB SAS3 Ceph Cluster provides high-capacity networked file storage via CephFS for media, backups and other bulk / read-heavy storage needs.
- The observability plane is implemented with clustered VictoriaMetrics, VictoriaLogs and VictoriaTraces, monitored with VMAlert and alertmanager and visualized in Grafana. Service metrics are automatically collected based on simple service tags and all system and service logs are collected via Grafana Alloy. Previous metric and log collection and storage services have included Telegraf, InfluxDB (v1, v2 and v3), Grafana Loki, Mimir and Tempo, Vector and Prometheus.
- All service authorization and authentication is handled by Authentik using OAuth, OpenID or reverse-proxy auth via Traefik. Authentik pulls its user and group data from OpenLDAP (which also provides Samba auth, sudo permissions, automount maps and user SSH public-keys.)