Description:
In this role, you will provide technology-based services incorporating strategy, design, development, delivery, and maintenance of technological solutions to meet business needs.
Be part of a team that…
- provides low latency high throughput authorisation services to NAB systems enabling our millions of customers and a plethora of internal teams to operate services safely and reliably at scale.
- With scale comes exciting challenges in our day-to-day work.
- We love solving problems with code (reliability-as-code, observability-as-code, x-as-code) and given the high usage and scale of our services, there will be plenty of areas to grow and explore.
In this role you will:
- Define and drive Infrastructure as Code (IaC) patterns and automation approach for changes and environment maintenance.
- Find problems, fix problems – end to end solution oversight and proactively prevent incidents in Production through use of Machine Learning and AI technologies.
- In-depth knowledge of monitoring and alerting for Production stacks
- Manage stakeholders and improve incident triage processes by liaising with Business, Service Owners, Service Delivery and Global Operations Centre.
- Identify shared engineering problems/gaps that have not been solved and work with NAB Engineering Framework (NEF) to develop/contribute those patterns to the wider NAB community.
- Support the compliance, governance and vulnerability management for the services, platform, and cloud infrastructure.
- Guide Infrastructure capacity planning activities to ensure safe usage and scaling of services.
- Work with Engineering Managers and Scrum Masters for inputs into agile planning.
- Knowledge of algorithms and data structures.
Continuous improvement (Passion for customers)
- Maintaining and improving existing code repos and peer review code changes while also being a code owner for any infra changes.
- Continually seek out relevant industry and technical knowledge to improve on professional skills.
- Identify cloud cost optimizations and compliance actions to help team become more productive/efficient overall.
What You’ll Bring…
Essential skills:
- Proven Site Reliability Engineering experience involved in delivering on operational support of high concurrent, low latency workloads.
- Experience in Information Technology coupled with tertiary qualifications such as BSc/BE in Computer Science or a related discipline.
- Proficiency in languages such as Python and Ansible.
- Experience in at least one major cloud provider such as AWS/Azure/GCP.
- Familiarity with various operating systems (Linux, Mac OS, Windows).
- Proficiency with containerisation technologies (Docker/Kubernetes).
- Experience with Open Telemetry (OTel) using system monitoring tools (e.g. AppDynamics, Splunk) and open-source performance testing frameworks (e.g. JMeter, Karate)
- Strong analytical and reasoning skills with an ability to visualise processes and outcomes.
- Outstanding all-round communication skills and ability to work collaboratively.
- Provide 24x7 escalation support to ensure the reliability, availability, and performance of critical production systems, promptly addressing and resolving incidents and outages to minimize downtime and ensure seamless operations.
- Experience with ITSC systems such as Service Now.