Senior Engineer
Reltio
Bengaluru, Karnataka, India
At Reltio®, we believe data should fuel business success. Reltio’s AI-powered data unification and management capabilities—encompassing entity resolution, multi-domain master data management (MDM), and data products—transform siloed data from disparate sources into unified, trusted, and interoperable data. The Reltio Connected Data Platform™ delivers interoperable data where and when it's needed, empowering data and analytics leaders with unparalleled business responsiveness. Leading enterprise brands—across multiple industries around the globe—rely on our award-winning data unification and cloud-native MDM capabilities to improve efficiency, manage risk and drive growth.
At Reltio, our values guide everything we do. With an unyielding commitment to prioritizing our “Customer First”, we strive to ensure their success. We embrace our differences and are “Better Together” as One Reltio. We are always looking to “Simplify and Share” our knowledge when we collaborate to remove obstacles for each other. We hold ourselves accountable for our actions and outcomes and strive for excellence. We “Own It”. Every day, we innovate and evolve, so that today is “Always Better Than Yesterday”. If you share and embody these values, we invite you to join our team at Reltio and contribute to our mission of excellence.
Reltio has earned numerous awards and top rankings for our technology, our culture and our people. Reltio was founded on a distributed workforce and offers flexible work arrangements to help our people manage their personal and professional lives. If you’re ready to work on unrivaled technology where your desire to be part of a collaborative team is met with a laser-focused mission to enable digital transformation with connected data, let’s talk!
Job Summary
We are seeking a Sr SDE – Reliability and Resiliency Engineering to drive platform reliability, resilience, and distributed systems robustness across the Reltio multi-cloud platform.
This is a hands-on engineering role focused on strengthening reliability practices, operationalizing chaos engineering, improving observability-driven validation, and embedding resilience into the software development lifecycle.
The ideal candidate has 5–8 years of experience building and operating distributed systems, good knowledge of multi-cloud environments (AWS, Azure, GCP), Kubernetes-based microservices, and the ability to contribute to resilience-focused design and engineering practices.
Job Duties and Responsibilities
Reliability & Resilience Engineering
- Design and implement resilience strategies across AWS, Azure, and GCP environments.
- Support chaos engineering initiatives and integrate experiments into CI/CD pipelines.
- Validate RTO, RPO, SLO, and recovery objectives using metrics.
- Follow governance controls for chaos experiments.
- Identify and mitigate potential failure scenarios before production.
Distributed Systems & Platform Engineering
- Improve resilience across databases, caching, messaging, and auth services.
- Support multi-region and cross-cloud failover validation.
- Validate retry logic, circuit breakers, and graceful degradation.
- Reduce cascading failures and improve system stability.
Observability & Continuous Validation
- Use Grafana, Prometheus, and LogDNA for validation.
- Integrate reliability checks into Jenkins CI/CD pipelines.
- Ensure production readiness before releases.
- Use telemetry to identify improvement areas.
Collaboration & Engineering Excellence
- Work with SRE, platform, and dev teams to improve resilience.
- Contribute to best practices and documentation.
- Participate in incident analysis and improvements.
Skills You Must Have
Professional Experience
- 6–9 years of software engineering experience.
- Experience with distributed systems and microservices.
- Exposure to AWS, Azure, and GCP.
- Hands-on Kubernetes experience (EKS, AKS, GKE).
Technical Skills
- Strong Java and Spring Boot skills.
- Understanding of distributed system failures and resilience patterns.
- Exposure to chaos tools (Chaos Mesh, Gremlin, Harness, etc.).
- Knowledge of AWS, Azure, GCP services.
- CI/CD experience (Jenkins preferred).
- Understanding of SLOs/SLIs.
- Experience with Grafana, Prometheus, LogDNA.
Skills That Are Nice to Have
- Exposure to multi-region architecture.
- Service mesh familiarity.
- Performance testing exposure.
- Security tools like Snyk.
- Good communication skills.
What Success Looks Like in This Role
- Reliability checks integrated into CI/CD.
- Chaos testing improves resilience.
- Failures detected early.
- Improved system stability.
- Contribution to resilience-first culture.
Reltio is proud to be an equal opportunity workplace. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. Reltio is committed to working with and providing reasonable accommodation to applicants with physical and mental disabilities.