Design and implement observability frameworks: Integrate and manage monitoring, logging, and tracing for cloud-native and on-premises systems.
Data integration and ingestion: Build systems to collect and ingest data from various sources, often through APIs, and manage time-series data at scale.
Develop custom solutions: Create custom dashboards, alerts, and reporting views to provide clear, actionable insights into system performance.
Build status pages: Develop and maintain status pages that provide a user-friendly web experience to communicate system health to stakeholders.
Tooling expertise: Leverage expertise in tools like Prometheus and Grafana for time-series data collection and visualization, and Dynatrace for deep-dive analysis and monitoring.
Collaboration and evangelism: Work with development, operations, and SRE teams to identify performance metrics (e.g., latency, errors)
Performance and reliability: Use data-driven insights to improve system reliability, reduce Mean Time To Resolution (MTTR), and optimize resource usage.
Required Skills and Qualifications:
Technical skills: o Proficiency in Prometheus, Grafana, and Dynatrace. o Experience with time-series data, including PromQL.
Strong knowledge of API design, data modeling, and data pipelines.
Expertise in scripting and backend development (e.g., Python, Go, Java).
Experience:
Proven experience designing and scaling observability stacks for production systems.
Hands-on experience with cloud platforms (AWS, Azure).
Experience with containerization (e.g., Kubernetes) is often required.
Familiarity with dashboards as code and Terraform.
Soft Skills:
Excellent problem-solving, communication, and stakeholder management skills.
bility to translate complex data into actionable recommendations for technical and non-technical audiences.
Apply for this Job
Please use the APPLY HERE link below to view additional details and application instructions.