Error Budgets includes:
- releasing new feature
- expected system change
- inevitable failure in hardware, network etc
- planned downtime
- risky experiment
- share responsibility for reliability between Ops and Dev teams.
- reduce feature iteration speed when our systems are unreliable.
Availability SLI
The proportion of valid requests served successfully.
One commonly used signifier of success or failure is the status code of an HTTP or RPC response. This requires careful, accurate use of status codes within your system so that each code maps distinctly to either success or failure.
A reasonable strategy here is to write that complex logic as code and export a boolean availability measure to your SLO monitoring systems, for use in a bad-minute style SLI like the example above.
Measuring SLI:
Application-level Metrics
Pros
|
Cons
|
-
Often fast and cheap (in terms of engineering time) to add new metrics.
-
Complex logic to derive an SLI implementation can be turned into code and exported as two, much simpler, "good events" and "total events" counters.
|
|
Logs Processing
rocessing server-side logs of requests or data to generate SLI metrics.
Pros
|
Cons
|
-
Existing request logs can be processed retroactively to backfill SLI metrics.
-
Complex user journeys can be reconstructed using session identifiers.
-
Complex logic to derive an SLI implementation can be turned into code and exported as two, much simpler, "good events" and "total events" counters.
|
-
Application logs do not contain requests that did not reach servers.
-
Processing latency makes logs-based SLIs unsuitable for triggering an operational response.
-
Engineering effort is needed to generate SLIs from logs; session reconstruction can be time-consuming.
|
Front-end infrastructure metrcis
Pros
|
Cons
|
-
Metrics and recent historical data most likely already exist, so this option probably requires the least engineering effort to get started.
-
Measures SLIs at the point closest to the user still within serving infrastructure.
|
-
Not viable for data processing SLIs or, in fact, any SLIs with complex requirements.
-
Only measure approximate performance of multi-request user journeys.
|
Probers
Pros
|
Cons
|
|
-
Approximates user experience with synthetic requests.
-
Covering all corner cases is hard and can devolve into integration testing.
-
High reliability targets require frequent probing for accurate measurement.
-
Probe traffic can drown out real traffic.
|
SLO & SLI
原文:https://www.cnblogs.com/anyu686/p/13493016.html