Site Reliability Engineer
Quipper's SRE is organized with the mission of creating a self-contained development team that collaborates with the development team to ensure stability, reliability, and quality development experience for the learning services (Quipper Video, Quipper School) using Quipper products, which are expanding further in Indonesia, and the Philippines. The work involves not only building, monitoring, and responding to infrastructure failures, but also, and especially, leveraging public cloud and SaaS, automation, and OSS, so that these systems can function automatically in the current organization.
Appeal of the work
Specifically, you will work with team members and the development team to create, implement, and deploy concrete ideas for the following issues.
- Create a program to learn the philosophy and skills of Site Reliability Engineering with the development team to explore better ways of product development.
- A system to optimize the CI/CD pipeline and improve the development experience through GitOps tools, etc.
- A mechanism for self-service of cloud resources such as AWS
- Planning and application of autoscaling for flexible scaling according to service growth and seasonality
- Simplification and automation of scaling through Cluster/Pod autoscaling
- Observability infrastructure using Envoy, OpenTelemetry, etc.
- Easy to use mechanism to promote stability of Microservices such as Circuit Breaker and Rate Limiting by using Istio/Envoy etc.
- Gradual migration of job execution platforms such as Jenkins to a Cloud Native form that matches the current form of the organization.
- A system to efficiently develop applications on Kubernetes using Telepresence, etc.
- A system to easily collect common metrics for each language and framework using Prometheus Exporter, etc.
- Preparation of log formats and libraries for searching logs across multiple applications.
- Optimization of resources and costs through the use of Savings Plans, Spotinst, etc.
The SRE will be responsible for understanding Quipper's server configuration, architecture, development team's issues, and issues to be solved as a product. We expect you to be able to proactively make proposals and engage in dialogues to solve problems, from implementation to the creation of a mechanism to advocating the solutions.
- Experience in operating an automation system on a public cloud such as AWS through Infrastructure as Code tools
- Experience in operating web applications
- Experience in writing programming languages other than shell scripts (Go, Ruby, Python, etc.)
- Experience using Docker and other container-related technologies.
- Sympathy for the mission of realizing "Distributors of Wisdom" and "Revolution in the Distribution of Knowledge", and sympathy for the engineering style.
- Able to communicate interactively with all parties involved in product development and other tasks that are not confined to SRE to make things better
Nice to have:
- You've worked on a DevOps team and developed products from both a Dev/Ops perspective.
- Have established SLI/SLO and created a culture to review them.
- Experience in improving the developer experience by building CI/CD pipelines and development environments.
- Experience in thinking, implementing, and evolving architecture from both organizational and technical perspectives
- Expertise in stable operation and management of distributed systems such as Microservices.
- Experience in building infrastructures to ensure Observability such as Logging, Tracing, Metrics, etc.
- Experience in designing Cloud Native applications for your company
- Able to select the appropriate database among multiple databases such as RDBMS/KVS/Column-oriented DB according to the required requirements and characteristics
- Experience in building analytics system
- Programming experience in Ruby or Go
- Have or want to improve communication skills in English
Technology and tools
- Database: MongoDB, Amazon Aurora (PostgreSQL/MySQL), BigQuery, Treasure Data
- Infrastructure: AWS, GCP, Kubernetes, CircleCI, GitHub Actions
- Monitoring: Datadog, NewRelic, Pingdom, Sentry, Google Cloud Logging
- Communication: GitHub, Slack
Why apply to us?
- HMO upon regularization with additional 3 dependent (fully covered by the Company)
- 12 VL (5VL can be carried over to following year), Unlimited SL for 1 year, Additional benefit to special leave per event (Marriage, paternity, death, hospitalized family member)
- Company-issued item (laptop to be provided upon regularization)
- Promotion opportunities
- Opportunity to meet and train abroad
- Free location policy (support for working temporarily from global offices)
- Government mandated benefits
- Work conditions : Remote work setup