Site Reliability Engineer
About SRE Position
At Quipper, we are currently undergoing a gradual transition from a monolithic architecture to microservices. This is not only a technical issue, but also an organizational issue to move product development forward and explore better ways of providing education, and the importance of various DevOps/SRE practices can be felt with a real sense of urgency.
In the transition to microservices, a major challenge for the entire development organization, including SREs at present, is making development teams self-contained. When the same service is shared by multiple teams, releasing a feature requires complex coordination. Also, in creating a new service, if you need to coordinate with the infrastructure team to set up the necessary infrastructure, for example, that can cause a bottleneck.
This is where the self-contained team comes in. If each team can own a service according to their own scope of responsibility, the cost of coordination can be reduced to as close to zero as possible, and the cycle of developing the functions we need can be repeated many times at low cost, and through this we can create more correct products.
In addition, if the development team is able to select the technology for the infrastructure layer, they will be able to quickly verify the best options for the necessary functions. In such an organization, the SRE team is expected to make the product better by empowering the development organization, rather than just fulfilling requests.
- Leveraging Public Cloud and SaaS
- Quipper's services are currently built on AWS and GCP.
- For the non-core parts of the service, we are proactively using SaaS.
- Datadog for monitoring, Sendgrid for email, Sentry for error management, etc.
- Infrastructure as Code
- Infrastructure is coded and automated using tools such as Terraform and Ansible whenever possible.
- Leveraging OSS
- Kubernetes, Envoy, Argo CD, Ruby, Ruby on Rails, Go, gRPC, PostgreSQL, MongoDB, fluentd, etc. Quipper's services are supported by many OSS.
- It is recommended to not only use OSS, but also to be actively involved in reporting issues and creating Pull Requests as needed.
- Writing the code
- If it makes more sense to use an existing solution without writing code, we will do so, or if it makes more sense to create our own, we will do so.
Specifically, you will work with team members and the development team to create, implement, and deploy concrete ideas for the following issues.
- Create a program to learn the philosophy and skills of Site Reliability Engineering with the development team to explore better ways of product development.
- A system to optimize the CI/CD pipeline and improve the development experience through GitOps tools, etc.
- A mechanism for self-service of cloud resources such as AWS
- Planning and application of autoscaling for flexible scaling according to service growth and seasonality
- Simplification and automation of scaling through Cluster/Pod autoscaling
- Observability infrastructure using Envoy, OpenTelemetry, etc.
- Easy to use mechanism to promote stability of Microservices such as Circuit Breaker and Rate Limiting by using Istio/Envoy etc.
- Gradual migration of job execution platforms such as Jenkins to a Cloud Native form that matches the current form of the organization.
- A system to efficiently develop applications on Kubernetes using Telepresence, etc.
- A system to easily collect common metrics for each language and framework using Prometheus Exporter, etc.
- Preparation of log formats and libraries for searching logs across multiple applications.
- Optimization of resources and costs through the use of Savings Plans, Spotinst, etc.
Technology and tools:
- Database: MongoDB, Amazon Aurora (PostgreSQL/MySQL), BigQuery, Treasure Data
- Infrastructure: AWS, GCP, Kubernetes, CircleCI, GitHub Actions
- Monitoring: Datadog, NewRelic, Pingdom, Sentry, Google Cloud Logging
- Communication: GitHub, Slack
The SRE will be responsible for understanding Quipper's server configuration, architecture, development team's issues, and issues to be solved as a product. We expect you to be able to proactively make proposals and engage in dialogues to solve problems, from implementation to the creation of a mechanism to advocating the solutions.
After that, depending on your orientation and performance, you may be able to train and educate other members or be an engineering manager.
- Experience in operating an automation system on a public cloud such as AWS through Infrastructure as Code tools
- Experience in operating web applications
- Experience in writing programming languages other than shell scripts (Go, Ruby, Python, etc.)
- Experience using Docker and other container-related technologies.
- Sympathy for the mission of realizing "Distributors of Wisdom" and "Revolution in the Distribution of Knowledge", and sympathy for the engineering style.
- Able to communicate interactively with all parties involved in product development and other tasks that are not confined to SRE to make things better
- If Japanese is not your native language, you should have N2 or higher language skills.
Nice to have:
- You've worked on a DevOps team and developed products from both a Dev/Ops perspective.
- Have established SLI/SLO and created a culture to review them.
- Experience in improving the developer experience by building CI/CD pipelines and development environments.
- Experience in thinking, implementing, and evolving architecture from both organizational and technical perspectives
- Expertise in stable operation and management of distributed systems such as Microservices.
- Experience in building infrastructures to ensure Observability such as Logging, Tracing, Metrics, etc.
- Experience in designing Cloud Native applications for your company
- Able to select the appropriate database among multiple databases such as RDBMS/KVS/Column-oriented DB according to the required requirements and characteristics
- Experience in building analytics system
- Programming experience in Ruby or Go
- Have or want to improve communication skills in English
Why apply to us?
- HMO upon regularization with an additional 1 dependent (fully covered by the Company)
- 10 VL and 10 SL (unused SL is convertible to cash annually)
- Company-issued item (laptop)
- Promotion opportunities
- Opportunity to meet and train abroad
- Free location policy (support for working temporarily from global offices)
- Government-mandated benefits
- [Current] Remote work setup