Services that CSE develop use logging and monitoring for understanding service performance, get insight into complex workflows and a tool for debugging service failures.
- Collect and evaluate key performance metrics on a service and across multiple services
- Trace complex requests and workflows within and across multiple services
- Ability to diagnose and debug errors and service outages
- Provide the required information to understand the state of the running system
- Application faults and errors are logged
- Additional log information (warning, debug, trace) are collected depending on configuration
- Logging configuration can be modified without code changes (i.e. increase verbosity when troubleshooting problems)
- Metrics are collected to identify service health
- In microservice architectures correlation ID is used to follow the flow of an operation among participant services
- GDPR compliance is ensured regarding personal identifiable information
Logging should be taken care since the beginning of the software development. We should avoid a scenario where production has issues but our application does not provide enough information to help us identify the problem.
A good exercise during software development is to ask ourselves what can go wrong in this operation/action? Once problem areas have been identified add log information. A few guidelines:
-
When interacting with an external system we should know when a call fails. On a more detailed level we might want to log all calls made to external system. Doing so allows us to answer questions as "Are you calling our system? When? What were the inputs and outputs?".
-
Be able to identify which configuration settings our application is running. For instance,
using API located at http://xyzorsetting min io threads to 5. This information helps identify why reading files asynchronously is taking longer than expected. -
Log when important events occurred, i.e. when an order is created. Not finding logs for "order created" in the last 30 minutes might indicate a problem.
-
Adding more information about errors. For instance, when a SQL call fails by default we only know the low level reason (stored procedured failed). By catching the error and logging "Failed to update basket" we also know in which scenario the stored procedure was called
- Use existing logging frameworks and store logs in separate storage
- Ensure log messages are machine readable/parsable and follow a common schema across all services
- Add contextual information and use log severity levels
- Have consistency when using log severity levels
- Leverage quality of service to continuously validate application availability and responsiveness
- Use metrics to have an overview of the system performance and health (request times, request counts, queue lenght, etc.)
General Data Protection Regulation requires paying attention to log content. Email addresses, usernames, phone numbers, IP addresses are among the type of information considered personal, which has restrict rules regarding storage. It is important to pay attention to what is logged. A technique to overcome this is anonymizing sensitive data (for instance replacing last digits of IP address). Moreover, pay attention to log transmission and data at rest, ensuring that encryption is in place.