1401 Flower Street
Job Category: DevOps
Job Number: 18947
The Senior Systems Reliability Engineer is expected to have expert level systems administration skills on both the Linux and Windows server platforms, and must have extensive experience with web technologies, source management, cloud hosting, container computing, and the DevOps team culture. This position will also bring expertise on systems, operational excellence and application stability, security, performance, and capacity management, as well as documentation.
This position works closely with various Animation Tools and Ride Engineering teams of the company to brainstorm, architect, gather requirements, troubleshoot, and provide stellar systems and automation. The role requires someone who is creative, proactive, constructive, and highly motivated. The Senior SRE must be prepared to work in an extremely collaborative and high-energy environment.
TECHNICAL REQUIREMENTS FOR THE ROLE
- 7 or more years of experience with relevant internet technologies and with implementing, administering, and supporting production websites and backend support systems.
- Understand how to install and configure operating systems, specifically with expertise in Linux and Windows Server.
- Experience in public and private cloud hosting services (AWS, Google Cloud, Azure) as well as familiarity with container computing (e.g. Docker, Mesos, Kubernetes).
- Software Development Continuous Integration (CI) Pipeline knowledge (e.g. Jenkins/Gitlab CI)
- Experience with Source Control Management systems (Git, Git LFS)
- Config Management Experience (e.g. Chef, Puppet, Ansible)
- Recognized as a subject matter expert on at least one OS and proficient in multiple operating systems, including OS performance monitoring, setup, configuration, tuning, and troubleshooting.
- Experience with MSBuild, CMake, and Windows Installers (e.g. Nullsoft)
- Experience with Visual Studio and VS Build Process
- Recognized as a subject matter expert on at least one web server and application server technology, including setup, configuration, performance monitoring, tuning, clustering, and debugging (e.g. JConsole).
- Able to implement existing base standards for new systems and/or applications with mentoring for the following:
o Site monitoring and instrumentation
o Application monitoring and instrumentation
o System monitoring and instrumentation
o Resilience and performance
- Able to diagnose simple to complex system problems.
- Can understand internet technologies and network protocols, including HTTP, basic load balancing configurations, security zones, VIPs, etc.
- Can understand application design and dependencies for the sites the team supports.
- Able to author tools and scripts to be used by others to automate repeatable production tasks in standard languages like bash, python, go, and PowerShell.
- Advanced skills in at least one programming language such as Python, Ruby, Java, Go, or C++ and able to build unit test suites for all software being developed.
- Able to author test plans for use by peers and junior SREs.
- Able to perform and provide in depth analysis on load test runs against a moderately complex system.
- Demonstrates exceptional troubleshooting methodology, including the ability to author and instruct new methodologies to the SRE team.
- Demonstrate ability to independently triage moderately complex incidents.
- Independently resolve moderately to highly complex system and application incidents.
- Able to identify and propose system and application fixes for performance bottlenecks.
- Able to evaluate new application requirements for capacity and run-time best practices.
- Able to evaluate new system and/or infrastructure solutions for technical feasibility against known requirements and standards.
- Effective at dealing with change: Able to transition in role or handle a significant modification to workflow or technology with minimal ramp-up time and with very little guidance.
- Serves as primary point of contact with Manager.
- Demonstrates curiosity and continuous learning and self-improvement.
- Ability to lead functional teams in systems integration and design including writing operational specs, architectural diagrams, test plans and requirements management.
- Communication of ideas and solutions in a clear and organized manner.
- Clear and effective presentations to groups of people.
- Effective project management and planning on large-scale projects (familiarity with agile/scrum project management a plus).
- Ability to design and deliver training to other staff.
- Construction of concise and complete technical documentation.
- Detailed understanding of the goals and requirements of the business supported.