Leura: An Operation Automation System for SRE
This article aims to give a high-level description of the architecture of project Leura, demonstrate its design philosophy, and also provide suggestions for its design and development process.
Please be aware that this is just the raw version of the initial architectural design, and the content may be outdated, erroneous, or not accurate.
Short Description
Leura is an open-source operation automation system tailored for SRE use.
Why Automation is Necessary for SRE?
A good operation automation system is essential for SRE because it enhances efficiency, consistency, and scalability in managing large-scale systems.
Automation reduces risks of human error by standardising repetitive tasks, ensuring thy are executed correctly, reliably and uniformly. This enables the SRE team to focus on solving higher-level problems rather than manual tasks and trifles.
What’s more, automation allows for rapid and automated response to incidents and failures, improving system reliability, which are critical for maintaining SLOs and ensuring good user experience.
Requirements
This is the very first version of requirement analysis for this system, which may be not accurate. Another document will be released in the near future to reflect the newest requirements of Leura.
Functional Requirements (FRs)
- The system must provide functions that allow the administrator to control access to the system for multiple users.
- The system must provide a function to manage multiple users.
- The system must provide a function to manage user groups.
- The system must provide functions to create, grant, and revoke permissions for multiple users and user groups.
- The system must provide functions to manage access to services within the system with different permissions.
- The system must support users to manage steps.
- The system must provide functions to create script execution steps.
- The system must provide functions to create file transfer steps.
- (Optional) The system should support customised steps.
- (Optional) The system should support step execution using different accounts.
- The system must provide functions to manage and execute tasks.
- The system must provide functions to create, edit, search, and delete tasks.
- The system must support users to combine steps into tasks.
- (Optional) The system should support task executions on servers in different subnets.
- The system must provide functions to save logs generated by steps.
- The system must pack logs into .tar.gz files and save them in file systems.
- The system must save a specific amount of logs in MongoDB for preview.
- The system must provide different kinds of APIs for access.
- The system must provide RESTful APIs.
- (Optional) The system should provide gRPC APIs.
- The system must be simple to deploy.
- The system should be containerisation friendly.
- The system should minimise components to reduce complexity.
Non-Functional Requirements (NFRs)
- Performance:
- The system should be able to maintain performance on limited resources (like on AWS EC2 micros).
- The system must minimise resource consumptions to avoid resource compete with applications deployed on the same server.
- Security:
- A user must not be able to read data without permission.
- Scalability:
- The system should be able to easily scale up to support large-scale clusters.
- Fault tolerance:
- The system must be able to restore from faults without human interfere.
Architectural Design
Main Design (Monolithic)
The main architectural design for the system is shown in the diagram below.
The system, consists of a server and several agents (deploy aside with the application to be managed by the system), follows the Client-server model. It’s composed of several key components:
- Leura (Server):
- API: Provides external interface to access services provided by the system.
- CRUD service: Handles the creation, read, update, and deletion of tasks and related data within the system. It also manages the scheduler to support timed tasks and cron tasks.
- Scheduler: Supports and triggers timed tasks and cron tasks.
- Execution engine: A high-performance coroutine to manage the lifecycle of tasks and steps. It’s based on event loop, and contains five states:
- Initialisation: Create tasks and/or related steps, prepares necessary resources and configurations.
- Schedule: Adds the steps to the queue.
- Dispatch: Sends the next step to the agent on target servers.
- Collect: Gathers the log and execution result from the agents.
- Finalisation: Start the next step or complete the whole task according to the execution result.
- Data storage: Saves necessary persistent data for the system.
- Leura Agent (Client): The actual execution component to execute the dispatched tasks and steps.
Both the server and the agent are monolithic.
Alternative Design (Microservice)
Alternatively, the system can also apply the Microservice architecture to ensure fault tolerance by take the whole system part into different microservices.
Comparison
Performance | Simplicity | Fault tolerance | Maintainability | Sustainability | |
---|---|---|---|---|---|
Monolithic | + | + | - | + | - |
Microservice | - | - | + | - | + |
In conclusion, monolith is a better choice for this system.
Suggested Technology Stack
- Programming language: Go
- The system heavily relies on concurrency programming, which requires high performance and concurrent programming support.
- Back-end web development library (based on Go): gRPC Gateway
- gRPC Gateway is the best choice when we need RESTful APIs and gRPC APIs at the same time since it generates them using the same .proto file.
- Database: MongoDB
- Structure of the data produced by customised services may be arbitrary.
- The system does not requires transaction supports.
- Write performance is important since logs generated by steps may be really large.
- Object storage: MinIO
- MinIO is high-performance and ideal for .tar.gz file storage.
- (Optional) Front-end: React
Glossary
- Step: An atomic operation in a typical operation process, like:
- Execute a script on server(s).
- Transfer a file to server(s).
- Call a third-party system.
- Task: An operation process consists of one or more steps.
- Timed Task: A special task that scheduled to be triggered on a specific time.
- Cron Task: A special task that runs periodically (similar to Linux Crontab).
- Script: A simple program for completing an operation on a server, usually developed by Python or Shell.