Site Reliability Engineer

Posted 27 January by Nominet

MAIN JOB PURPOSE

We are establishing a new, dedicated DevOps team in our Internet Engineering & Operations function, working closely with software developers and infrastructure engineers - all operating at the heart of the internet. Working within thIS DevOps team, you will be responsible for our application monitoring across the technical estate. Building on our existing monitoring framework, you use your infrastructure and scripting experience to add bespoke and robust monitoring checks to continuous assess the health of our services and applications.

JOB SUMMARY

  • Take ownership over the monitoring of applications, services and infrastructure.
  • Write and maintain software and scripts that capture detailed heuristics about the health of applications and alert accordingly.
  • Design and implement monitoring checks for new services prior to launch.
  • Ensure consistent and thorough monitoring across all environments (development, beta, production, etc).
  • Capture improvements to the logging platform including integrating with LogStash.
  • Expand the existing monitoring within Zabbix and investigate and prototype monitoring checks using alternative frameworks.
  • Integrate with 3rd-party APIs and services to export application log data for auditing purposes.
  • Work with DevOps Engineers, Sys Admins and Software Developers during software releases.
  • Write automated monitoring tests and integrate within the CI/CD framework.
  • Be an ambassador for DevOps across the business, influencing others to embrace automation and DevOps principles.
  • Work with the Release Manager to ensure successful and streamlined production deployments.

KEY REQUIREMENTS

  • Background in software engineering (using languages such as Java, Python, etc).
  • Knowledge of JMX and Java-based application monitoring.
  • Experience with Linux.
  • System design knowledge.
  • Experience monitoring Kubernetes clusters and pods.
  • Confident monitoring the health of servers (cloud-based and on-prem) including CPU, Memory, Storage.
  • Confident with Ansible, Terraform, GIT.
  • Experience with AWS.
  • Network and security knowledge.
  • Experience deploying, managing and troubleshooting of software applications (including Web Apps and B2B).
  • Happy working using Agile practices, and JIRA.
  • Knowledge of Zabbix and LogStash/ELK highly desirable.

Reference: 41322251

Bank or payment details should never be provided when applying for a job. For information on how to stay safe in your job search, visit SAFERjobs.

Report this job