Systems Reliability Engineer

System Reliability Engineer (SRE)
Location: Hastings - Up to 1 WFH day per week
Contract Term: 12-month contract
Start Date: From May 2025
About the Role: Atlas Recruitment is partnering with a leading manufacturing company based in Hastings to recruit an experienced System Reliability Engineer (SRE). This key position focuses on ensuring the stability, performance, and reliability of critical IT and OT systems within the organisation. The ideal candidate will be proactive in monitoring system health, responding to incidents rapidly, performing root cause analysis, and contributing to the continuous improvement of system performance and operational efficiency.

As an SRE, you will be working closely with both development and operations teams, ensuring that reliability standards are defined, implemented, and maintained. This role offers the opportunity to work in a dynamic, fast-paced environment and make an impactful contribution to the overall system health.

Key Responsibilities:

System Monitoring & Support: Monitor and support the health, availability, and performance of business-critical IT/OT systems.
System Improvement: Implement small-step, low-risk improvements to enhance system reliability and reduce technical debt.
Automation & Incident Response: Automate monitoring, alerting, and incident response processes to enhance operational efficiency.
Collaboration: Work closely with development and operations teams to define, enforce, and refine reliability standards.
Root Cause Analysis: Perform root cause analysis and contribute to post-incident reviews, ensuring long-term fixes are implemented.
Infrastructure as Code: Maintain infrastructure as code and contribute to continuous integration/continuous deployment (CI/CD) improvements.

Technical Skills Required:

Proficiency with C#, SQL, and Azure Web Services to ensure efficient system management and troubleshooting.
Experience with Observability Tools such as Zabbix, Prometheus, and Grafana to monitor system performance.
Familiarity with Docker and Containerized Environments for application deployment and management.
Exposure to Legacy Systems like VB6 is advantageous, as the company has legacy applications in their tech stack.

Additional Tools & Practices:

Experience with JIRA, SCRUM, and Confluence to support project management and communication across teams.
Incident Management Tools: Familiarity with tools like OpsGenie or similar for effective incident response.
Understanding of SRE Concepts such as SLIs (Service Level Indicators), SLOs (Service Level Objectives), error budgets, and operational runbooks to maintain system reliability.
Strong Communication & Proactive Mindset: The ability to take ownership of system health and work independently with a collaborative approach.

Qualifications & Experience:

A Bachelor’s or Master’s degree in IT, Computer Science, or a related field.
5+ years of relevant experience in systems engineering, DevOps, or SRE roles.
A strong track record in system reliability engineering or a similar field within the manufacturing or IT sectors.

Why Join Us? This is a fantastic opportunity to join a well-established manufacturing company that is committed to maintaining cutting-edge IT/OT systems. As part of the team, you will be at the forefront of driving system improvements and helping maintain a high standard of operational reliability.
Timeframe: This is a 12-month contract position with a 3-month trial period. There is potential for extension based on business needs and performance.
To Apply: If you’re an experienced System Reliability Engineer looking for an exciting challenge in a dynamic, supportive environment, we encourage you to apply. Please submit your updated resume along with a cover letter outlining your relevant experience and why you're a perfect fit for this role.