Herzliya, Israel
9 hours ago
Sr. Software Engineer (with Experience in SRE)

Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within Digital & Platform Services team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact. To create and drive SRE culture, mindsets and behavior by implementing industry best practices, providing SRE teams with consulting/coaching and a data driven view of how SRE teams are performing and improve.

Job responsibilities

Demonstrates expertise in site reliability principles and demonstrates an understanding of the fine balance between features, efficiency, and stability Act as the main point of contact during major incidents, demonstrating technical expertise to quickly identify and solve issues, while documenting and sharing knowledge within the organization. Evolves and debug critical components of applications and platforms Effectively manage incidents and strive to enhance Mean Time to Recovery (MTTR) and other MTTx metrics through proactive monitoring and response strategies. Implement and maintain observability tools, including Dynatrace, Grafana, Splunk, CloudWatch, and Datadog, to monitor and ensure system performance and reliability. To visualize these metrics and set up dashboards for real-time monitoring. Deploy and oversee services within cloud environments, with a strong preference for AWS, ensuring optimal performance and cost-efficiency Drive continuous improvement of reliability, monitoring, and alerting for mission-critical microservices, while reducing toil through automation and creating reliable infrastructure and tooling to expedite feature development. Develop and implement metrics for microservices, define user journeys, SLOs, and error budgets, and configure dashboards and alerts, facilitating blameless post-mortems to ensure permanent incident closure. Engage with development teams throughout the software lifecycle to enhance reliability and scale, design self-healing and resiliency patterns, and implement infrastructure, configuration, and network as code. Collaborate with software engineers and teams to design and implement deployment approaches using automated CI/CD pipelines, supporting the adoption of site reliability engineering best practices. Implement and regularly testing DR strategies to ensure highest level of resilience and fault tolerance of the platform. Maintain and promoting high-quality written documentation of assets, processes and runbooks that are used by the team in their day-to-day operations. Effectively negotiates with peers and executive partners to ensure optimal outcomes for all and drives the adoption of site reliability practices throughout the organization Ensures your teams demonstrate site reliability best practices with the ability to demonstrate this empirically through stability and reliability metrics Drives a culture of continual improvement and solicits real-time feedback to improve the dev and client experience Ensures your team collaborates with other teams within your group’s specialization and avoids duplication of work where possible Follows blameless, data-driven, post-mortem strategies and conducts regular team debriefs to enable learning from both successes and mistakes Provides personalized coaching for entry to mid-level team members  Ensures your team documents and shares their knowledge and innovations via internal forums, communities of practice, guilds, and conferences 

 

Required qualifications, capabilities, and skills

Formal training or certification on software engineering concepts and 15+ years applied experience. Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform Proficiency in at least one programming language, such as Python, Go, or Java/Spring Boot, with expertise in designing, coding, testing, and delivering software. Proven experience deploying services to cloud platforms, preferably AWS. Understanding and working experience in AWS applications, and understanding of resiliency, scalability, observability, monitoring etc. Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.) Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.) Proficiency and experience in observability such as white and black box monitoring, SLO alerting and comprehensive experience with observability tools, specifically Dynatrace, Grafana, Splunk, CloudWatch and Datadog. Experience in incident management and improving MTTx metrics. Hands-on experience with relational databases like Oracle or MySQL Strong problem-solving skills with a high level of accountability. Excellent written and verbal communication skills. Ability to work independently and manage ambiguous scopes effectively. Experience with troubleshooting common networking technologies and issues Ability to identify and solve problems related to complex data structures and algorithms .Ability to expand and collaborate across different levels and stakeholder groups

 

Preferred qualifications, capabilities, and skills

Experience in banking / financial domain is preferred. Experience in Infrastructure Architecture designs. Familiarity with GCP, AWS ECS, EKS & Terraform Expertise in networking concepts (TCP/IP, DNS, TLS, HTTPS) and CDN technologies

 

 

 

 

Por favor confirme su dirección de correo electrónico: Send Email