Context
A system can become partially or inoperable depending on multiple factors. At that situation, in most cases, the whole system is taken down until all the above factors have been resolved. But depending on those factors, a system can still partially operate.
With the System Mode of Operation, we can place selective restrictions on operations and avoid taking down the whole system.
Some use cases for Mode of Operation can be the following-
- Updating one or many internal services.
- Database migration
- System Deployment
- Unavailable external dependency
- Internal unknown system failure debugging in Production
- Partial infrastructure upgrade
Requirements
Mode of operation is a system-level state, that dictates whether an operation can be performed or not. Services will be resilient in case part of the system is non-functional. For example:
- Can not create order payment processing exchange is offline.
- If there is some problem with storage devices (ex. database, blob storage, etc.) then the system should be in read-only mode.
- In case of critical failure, we don’t want any operation to happen on top of faulty/corrupted data.
Design Considerations
- The system state should be cached to avoid triggering rate limitation on burst requests because
- External service provider API might have a rate limit based on identity token or IP address.
- Status check might be needed in multiple operations of a single task flow. (ex. Order processing is a task flow).
- Multiple task flows will be executed in parallel.
- A REST API endpoint is needed so that, other services connecting to the system can check before placing an order instead of receiving an error after placing it.
Implementation
- Time Trigger function to,
- Fetch status from dependencies every 5 minutes (exchange, currency conversion rate provider, external system, etc.).
- Determine the current state (placing/canceling order etc.).
- If the state has changed, store the state in the database and Redis cache.
- HTTP Trigger function that returns the current system states.
- Modes of operation will be
Mode | Description |
---|---|
Normal | All operations will be executed |
No-New-Order | Can execute all operations but no new buy/sell order can be placed |
Read-Only | Can execute view operations but no update/delete can be done |
Unavailable | No operations can be executed on the system |
- The source of the configuration change will also be stored along with it. Configuration change can come from two sources:
- User
- Change was done by an administrative person of the System.
- Supersedes all other change sources, meaning this configuration will not be overridden by any other source.
- SystemProcess
- Logic running within the System has determined that the configuration should be changed.
- Configuration will be overridden only if the previous configuration source was the system itself.
- User
Mode Determination
- If the source of the system mode is User, then we operate based on the set mode.
- If the source of the system mode is SystemProcess, the run predetermined logical operation and update Mode of Operation for the System. The predetermined local operations can be of the following-
- Status check for external system dependency.
- Status check for infrastructure resources.
- System maintenance configuration.
Matrix
State is valid if it is not null, white-space, or empty
Is Valid | Normal | No-New-Order | Read-Only | Unavailable | |
---|---|---|---|---|---|
Operation 1 | ❌ | ✅ | ❌ | ❌ | ❌ |
Operation 2 | ❌ | ✅ | ❌ | ❌ | ❌ |
Operation 3 | ❌ | ✅ | ✅ | ✅ | ❌ |
Operation 4 | ✅ | ✅ | ✅ | ✅ | ❌ |
- Operations 1, 2, and 3 can not be performed but only Operation 4 can be executed if there is no valid state stored in the system. Some examples of Operation 4 are Health Check, Metrics, Administration, etc.
- All operations can run if the mode is Normal.
- Operations 1 and 2 can not run if the mode is No-New-Order or Read-Only, but Operations 3 and 4 can.
- Operations 1 and 2 can be creating orders, and updating orders.
- Operations 3 and 4 can be getting order information or status, conversion rates, or other dependency information for validity.
- No operation can run if the mode is Unavailable.
Flow Diagram
Click Here if you want to see the above diagram in Light theme.