Site Reliability Engineer

Posted 13 May by Altitude Angel
Easy Apply

Register and upload your CV to apply with just one click

We at Altitude Angel, envisage a future where drones are set to revolutionise the lives of everyone on the planet, so we’re building the platforms to ensure drones can fly safely alongside other, more traditional, forms of aviation.

Widely regarded as a technical leader in our field, our national foundation technologies are deployed internationally by governments, civil aviation authorities and air navigation providers to meet the needs of an emerging industry.

Our platform is now used in 152 countries across the globe enabling over 50,000 safe drone flights every month, including everything from transport of life saving medication in sub-Saharan Africa to coffee and burger delivery in the Republic of Ireland.

We are seeking an experienced Site Reliability Engineer (SRE) to join our team and play a pivotal role within our Product Engineering department, in maintaining and enhancing the performance, reliability, and scalability of our systems. The ideal candidate will lead and establish SLAs and SLOs, optimize resource usage, and proactively monitor system health. You will be instrumental in developing solutions that ensure seamless operations, cost-effective performance, and continuous service improvement.

What You will Do

  • Establish service level objectives (SLOs) and service level agreements (SLAs)
  • Optimise system performance and scalability through effective resource management, load distribution, and latency reduction
  • Develop proactive monitoring solutions and dashboards to provide visibility into system health and performance while alerting on potential issues
  • Ensure services operate within defined budget constraints, identifying opportunities for cost-saving and optimization
  • Create and maintain comprehensive documentation for system architecture, configuration, and troubleshooting procedures
  • Partner with development teams to ensure new features and enhancements meet reliability and performance standards
  • Conduct root cause analysis post-incident and implement preventive measures to avoid future occurrences
  • Automate repetitive tasks and processes to enhance efficiency and minimize manual intervention
  • Stay informed about industry best practices, emerging trends, and new technologies in site reliability engineering
  • Identify technical debt and collaborate with application teams to establish remediation plans
  • Deliver continuous service improvement through Infrastructure as Code development

Secondary responsibilities:

  • Perform daily health and compliance checks on systems as required
  • Validate and promptly resolve monitoring alerts and batch job failures
  • Ensure sufficient capacity is available to support growth
  • Respond promptly to emails sent to team distribution lists or mailboxes
  • Handle incidents and requests efficiently, prioritizing a "customer-first" mindset
  • Maintain highly available, reliable, secure, and performant infrastructure
  • Conduct general server, database, and virtualization administration maintenance activities
  • Provide technical support to application support and development teams.
  • Offer consultation to application support and development teams


Key Requirements

Essential:

  • Proficiency in Docker/Kubernetes deployment, scaling, and managing containerized applications
  • Expertise in managing and optimizing monitoring stacks like Grafana, Prometheus or Azure Monitor
  • Strong skills in creating dashboards using PromQL or KQL
  • Experience with CI/CD/CT platforms such as Azure DevOps
  • Familiarity with "Infrastructure as Code" and "Continuous Integration and Continuous Development" principles and practices
  • Knowledge of Agile, Site Reliability Engineering (SRE), and DevOps principles and practices
  • Proficiency in scripting and programming languages such as PowerShell, Python, Bash, and C#
  • Knowledge of backup and recovery processes and procedures
  • Advanced understanding of clustering, high availability, replication, and disaster recovery techniques
  • Strong ability to tune network, storage, server, and virtualization layers for optimal performance and reliability
  • In-depth performance tuning skills and system internals knowledge
  • Experience implementing CIS security hardening recommendations

About you

  • Excellent communication and interpersonal skills
  • Ability to handle pressure during outages and systematically resolve issues
  • Strong problem-solving abilities
  • Results-driven with a strong sense of accountability
  • Proactive and motivated approach
  • Ability to work with urgency and prioritise work effectively
  • Structured and logical work approach
  • Attention to detail and accuracy
  • Proficiency in managing constructive conflict
  • Ability to communicate complex technical concepts to non-technical audiences

What’s in it for you?

At Altitude Angel we’re dedicated to creating a positive working environment which empowers our employees to be the best version of themselves. Through personalised Career Development Plans and generous home/work life-balance considerations, we support our team to work in the best way for them.

Equality, Diversity and Inclusion at Altitude Angel

Here at Altitude Angel we are committed to cultivating and preserving a culture of inclusion and connectedness in line with our mission to safely and securely open the skies for all. We are able to grow and learn better together with a diverse team of employees. The collective sum of the individual differences, life experiences, knowledge, innovation, self-expression, and talent that our employees invest in their work represents not only part of our culture, but our reputation, our products and Altitude Angel's achievement as well. In recruiting for our team, we welcome the unique contributions that you can bring in terms of education, opinions, culture, ethnicity, race, sex, gender identity and expression, nation of origin, age, languages spoken, veteran’s status, colour, religion, disability, sexual orientation and beliefs.

Reference: 52644827

Please note Reed.co.uk does not communicate with candidates via Whatsapp, and we will never ask you to provide your bank, passport or driving licence details during the application process. To stay safe in your job search and flexible work, we recommend visiting JobsAware, a non-profit, joint industry and law enforcement organisation working to combat labour market abuse. Visit the JobsAware website for information and free expert advice for safer work.

Report this job