System Analyst II - Site Reliability Engineer
Durham, NC, US, 27710
At Duke Health, we're driven by a commitment to compassionate care that changes the lives of patients, their loved ones, and the greater community. No matter where your talents lie, join us and discover how we can advance health together.
About Duke Health Technology Solutions
Pursue your passion for caring and innovation with Duke Heath Technology Solutions, which is dedicated to the transformation, development, and management of enterprise information technology solutions across Duke Health. By harnessing the power of innovative technologies like cloud computing and artificial intelligence — and pairing them with a forward-thinking approach — Duke Health Technology Solutions is revolutionizing the future of health care at Duke Health and beyond.
Occupational Summary
The DHTS Systems Analyst-Site Reliability Engineer (SRE) is responsible for designing, implementing, and maintaining large-scale distributed systems with a focus on reliability, scalability, and performance. The SRE collaborates with development teams to ensure that applications and services are designed and operated to meet reliability targets and scale efficiently. This role involves working with OpenShift for on-premises environments and Azure Kubernetes Service (AKS) for cloud-based solutions.
Essential Tasks/Responsibilities
Level 1 (DHTS System Analyst 1)
Under direct supervision, assist in monitoring and maintaining production systems to ensure high availability and performance, including OpenShift clusters on-premises and AKS in the cloud.
Participate in on-call rotations to respond to system alerts and incidents.
Assist in troubleshooting and resolving system issues and outages across both on-premises and cloud environments.
Help implement and maintain automation scripts for routine tasks and deployments in OpenShift and AKS.
Contribute to the creation and maintenance of documentation for systems and processes.
Assist in capacity planning and performance tuning of systems in both OpenShift and AKS environments.
Participate in post-incident reviews and help implement recommendations.
Learn and apply SRE best practices and methodologies specific to container orchestration platforms.
Collaborate with development teams to improve system reliability and efficiency across on-premises and cloud infrastructures.
Level 2 (DHTS System Analyst 2)
In addition to the duties described for Level 1, the Level 2 SRE will:
• Independently design and implement monitoring solutions for complex systems in OpenShift
and AKS environments.
• Lead incident response efforts and coordinate with multiple teams during outages, considering
the nuances of both on-premises and cloud infrastructures.
• Develop and implement automation solutions to improve system reliability and efficiency across
OpenShift and AKS platforms.
• Conduct thorough root cause analysis for incidents and propose long-term solutions that align
with the organization's hybrid infrastructure strategy.
• Contribute to the design and implementation of disaster recovery and business continuity plans,
leveraging both on-premises and cloud resources.
• Mentor junior team members and provide technical guidance on OpenShift and AKS best
practices.
• Participate in the evaluation and implementation of new technologies and tools that
complement OpenShift and AKS environments.
• Collaborate with development teams to define and implement SLIs, SLOs, and SLAs across both
platforms.
• Contribute to the development of architectural improvements to enhance system reliability and
scalability in a hybrid infrastructure model.
Level 3 (DHTS System Analyst 3)
In addition to the duties described for Level 2, the Level 3 SRE will:
• Function as a technical leader and subject matter expert in reliability engineering, with deep
expertise in both OpenShift and AKS.
• Lead the design and implementation of large-scale, complex distributed systems across onpremises
OpenShift and cloud-based AKS environments.
• Develop and implement strategies for continual improvement of system reliability,
performance, and efficiency in a hybrid infrastructure model.
• Lead cross-functional projects to improve overall system architecture and reliability, considering
the strengths and limitations of both OpenShift and AKS.
• Provide advanced troubleshooting and problem-solving for critical production issues in both onpremises
and cloud environments.
• Develop and maintain relationships with key stakeholders across the organization to align SRE
practices with business objectives.
• Drive the adoption of SRE best practices and methodologies across the organization, tailored to
the specific needs of OpenShift and AKS platforms.
• Contribute to the definition of technical standards and best practices for the SRE team, ensuring
consistency across on-premises and cloud environments.
• Mentor and provide technical leadership to junior and mid-level SREs in both OpenShift and AKS
technologies.
• Participate in strategic planning for infrastructure and reliability improvements, considering the
long-term evolution of the hybrid infrastructure model.
• Represent the SRE team in high-level technical discussions and decision-making processes
related to container orchestration and cloud strategy.
Advancement to the next level requires employee, at a minimum, successfully attain the following:
1. Proven ability to work at the next level: This involves demonstrating the skills and competencies
required for the next level of responsibility. Employees should have demonstrated that they can
handle tasks and challenges that are typically associated with the higher position.
2. Potential to serve beyond the next level: This measure looks at the employee's long-term
potential and their ability to grow within the organization. The employee should have the vision,
ambition, and capability to take on even greater responsibilities in the future.
3. Consistently demonstrates a values-based approach in how they work: Employees should
consistently exhibit behaviors and decision-making processes that align with DUHS values. The
exhibited values are integrity, teamwork, diversity excellence and safety. Patient-focused is also
critical to success.
4. Is considered one of the top performers at their level across the organization: This measure
evaluates the employee's overall performance and reputation within DHTS. Top performers are
often recognized for their exceptional contributions, reliability, and ability to exceed expectations.
We will select the best and not the best available.
Required Qualifications at this Level
Education
Bachelor's degree in a related field is preferred, or equivalent work experience.
Experience
• Level 1 (DHTS System Analyst 1): 0-4 years of software development experience and/or IT
solutions engineering.
• Level 2 (DHTS System Analyst 2): Minimum 5 years of software development experience and/or
IT solutions engineering.
• Level 3 (DHTS System Analyst 3): Minimum 10 years of software development experience
and/or IT solutions engineering.
Required Skills and Knowledge
Level 1 (DHTS System Analyst 1)
• Basic understanding of Application Development Lifecycle, ideally with DevOps focus
• Familiarity with script writing (e.g., Ansible Playbooks, Helm Charts)
• Basic knowledge of containerization and orchestration technologies (Docker, Kubernetes,
OpenShift)
• Familiarity with CI/CD technologies like GitLab CI or GitHub Actions
• Basic understanding of server administration (preferably Linux)
• Understanding of networking topologies, firewall rules, and certificate management
• Ability to analyze customer requirements and translate into effective solutions
• Critical thinking and problem-solving skills
• Strong customer service orientation
• Basic troubleshooting and root cause analysis skills
• Familiarity with project management and Agile/SCRUM methodologies
• Proficiency in at least one programming language (e.g., Python, Go, Java)
• Familiarity with version control systems (e.g., Git)
Level 2 (DHTS System Analyst 2)
All Level 1 skills, plus:
• Strong experience with Application Development Lifecycle, with a DevOps focus
• Proficiency in script writing (e.g., Ansible Playbooks, Helm Charts)
• Extensive experience with containerization and orchestration technologies (Docker, Kubernetes,
OpenShift)
• Strong experience with CI/CD technologies and practices
• Advanced knowledge of server administration (preferably Linux)
• Solid understanding of networking topologies, firewall rules, and certificate management
• Proven ability to analyze complex customer requirements and translate into effective solutions
• Advanced troubleshooting and root cause analysis skills
• Strong project management skills, including Agile/SCRUM experience
• Experience with cloud platforms (AWS, Azure, GCP) and services (SaaS, IaaS, PaaS, FaaS)
• Knowledge of Enterprise Architecture best practices
• Familiarity with AI and ML concepts
Level 3 (DHTS System Analyst 3)
All Level 2 skills, plus:
• Technical leadership in application development with a DevOps/CI focus
• Technical leadership in automation (Ansible, Terraform, Bash)
• Extensive experience with Continuous Integration / Continuous Delivery
• Extensive experience with server administration
• Expert knowledge of network and security concepts
• Proven ability to lead and mentor teams in adopting and optimizing container orchestration
practices
• Expert knowledge of cloud platforms (AWS, Azure, GCP) and services (SaaS, IaaS, PaaS, FaaS)
• Expert knowledge of Enterprise Architecture best practices
• Advanced knowledge of AI and ML concepts and their application in SRE practices
Desired Skills (All Levels)
• Red Hat OpenShift certifications
• CKA (Certified Kubernetes Administrator) or CKAD (Certified Kubernetes Application Developer)
certifications
• Experience with multi-cloud environments
• Knowledge of FHIR APIs and healthcare-specific technologies
• Excellent time management, organizational, and task prioritization skills
• Strong presentation skills
• Ability to communicate effectively with non-technical staff and members of interdisciplinary
teams
• Ability to interact well and effectively communicate with all levels of leadership
• Experience with data and system flow diagramming
• Familiarity with vulnerability management and patching for application containers
Additional Responsibilities (All Levels)
• Provide application system support for team apps, including rotating 24x7 support
• Develop relationships with vendors to ensure customer needs are met in a timely manner
• Author and update system documentation to share all knowledge acquired in the developer
guide
• Ensure systems conform to Duke Information Security Office policies and procedures
• Assist in oral and written presentations to project teams, customers, and management
• Coordinate and perform application testing
• Follow established Change Management processes
• Provide feedback on departmental processes and procedures and suggest improvements
• Plan and coordinate system and application upgrades
• Identify internal resources to build project teams as required
• Perform detailed analysis and documentation of customer workflows
• Collaborate with Administrative, Clinical, and Research customers to understand and meet
needs
• Develop relationships with key customer management representatives
Intent:
The intent of this job description is to provide a representative and level of the types of duties and
responsibilities that will be required of positions given this title and shall not be construed as a
declaration of the total of the specific duties and responsibilities of any particular position. Employees
may be directed to perform job-related tasks other than those specifically presented in this description.
Equal Opportunity:
Duke University is an Affirmative Action/Equal Opportunity Employer committed to providing
employment opportunity without regard to an individual's age, color, disability, gender, gender
expression, gender identity, genetic information, national origin, race, religion, sex, sexual orientation,
or veteran status.
Duke aspires to create a community built on collaboration, innovation, creativity, and belonging. Our
collective success depends on the robust exchange of ideas-an exchange that is best when the rich
diversity of our perspectives, backgrounds, and experiences flourishes. To achieve this exchange, it is
essential that all members of the community feel secure and welcome, that the contributions of all
individuals are respected, and that all voices are heard. All members of our community have a
responsibility to uphold these values.
Essential Job Function:
Certain jobs at Duke University and Duke University Health System may include essential job functions
that require specific physical and/or mental abilities. Additional information and provision for requests
for reasonable accommodation will be provided by each hiring department.
Duke is an Equal Opportunity Employer committed to providing employment opportunity without regard to an individual's age, color, disability, gender, gender expression, gender identity, genetic information, national origin, race, religion, sex (including pregnancy and pregnancy related conditions), sexual orientation or military status.
Duke aspires to create a community built on collaboration, innovation, creativity, and belonging. Our collective success depends onthe robust exchange of ideas—an exchange that is best when the rich diversity of our perspectives, backgrounds, and experiences flourishes. To achieve this exchange, it is essential that all members of the community feel secure and welcome, that the contributions of all individuals are respected, and that all voices are heard. All members of our community have a responsibility to uphold these values.
Essential Physical Job Functions: Certain jobs at Duke University and Duke University Health System may include essential job functions that require specific physical and/or mental abilities. Additional information and provision for requests for reasonable accommodation will be provided by each hiring department.
Nearest Major Market: Durham
Nearest Secondary Market: Raleigh