In asynchronous applications, the request UUID must be added to the log, by which it will be easy to track the entire chain of logs for a particular call.
If the logs are written in an understandable form and you can easily determine what exactly the error is, then the work has been done well. Experienced developers know that it is better not to skimp on the info level logs, because if there are not enough of them, you will need to add code, deploy it again to production and try to reproduce the problem, which will take quite a lot of precious time.
For error level logs, I recommend using an additional tool – Sentry. It is convenient to view new and most massive errors in it. For errors of the Exception type, you can save a traceback, which greatly speeds up their correction. It will also be convenient to set up notifications about all new errors in the Telegram channel.
For collecting application metrics, Prometheus is a good choice and then outputting them to Grafana.
Remember that the development and maintenance team must learn about all abnormal situations before user complaints. Therefore, if at the very first rollout of an application it may be enough to collect metrics for 4xx and 5xx application errors, delays for API methods and a couple of important business indicators, then as the code base grows and the importance of the service, it is necessary to constantly add more and more specific and narrowly focused metrics.
Step 1. Code review
If you carefully consider this step, then there is a chance to catch a huge number of errors and not blush in front of users. On code review, it is not necessary to work as an interpreter and meticulously check every line; it is important that the verifier understand the general idea of each written function, and in case of misunderstanding of some points, do not hesitate to contact the author of the code. Also make sure that a unit test is written for each critical piece of code, and check that the parameters checked by the test are correct. The total code coverage with tests should reach 80% (code coverage can be checked with third-party tools, such as SonarQube).
Also, during code review, it is important to check the migrations to the database: make sure that adding columns or creating an index does not “hang” the database. And if there is such a chance, move these steps into the pre-release part and execute with minimal load on the server.
Particular attention should be paid to SQL queries: search, update, and delete operations must always contain a WHERE part (if all parameters are passed to the function as null, then you need to crash and not allow a query to the database without parameters).
Also, for all suspicious queries, it is worth doing an explain (in the ideal case, from the selling database replica) and making sure that the “cost” of queries is low and indexes are used. Otherwise, you need to add the missing indexes to the migrations.
Step 2. Pentest
In order to prevent the data stored by your users from becoming available to third parties, you need to conduct application pentest (penetration testing). It is a method for evaluating the security of computer systems or networks by means of attack simulation. Domclick pentest is carried out by experts from Cybersecurity. If your company does not have specially trained engineers for vulnerability scanning, then I recommend taking at least a basic cybersecurity course (for developers) in order to avoid the most childish mistakes.
In my experience, the following set of actions will greatly reduce the risk of an attacker getting unprotected data:
Use up-to-date versions of libraries (older versions may have vulnerabilities). Choose the most popular libraries in the community.
When working with a database, do not use (or minimally use) raw SQL with concatenation. Make sure that the driver used to connect to the database automatically gets rid of dangerous special characters in the query (most SQL injections are based on adding special characters to the query and then executing the command the attacker needs).
When storing text that will later be displayed to users (for example, comments), turn the text into safe HTML so that the page does not display executable JS code.
For each API method, use role-based access control. Log under which user the operations of adding, changing, deleting data took place.
I also note that after fixing the vulnerabilities identified during the pentest, it is necessary to re-pass the code review.
Step 3. Load testing
To understand whether the service can withstand the “influx” of users, it is necessary to conduct load testing. Also, this step will help you understand how many requests one instance or sub service can withstand, and correctly calculate the margin in case of peak loads.
There are a lot of tools for load testing, and you can choose Apache JMeter or the wrk utility (for local tests) as the basic ones. To create a simulation of a complex load profile, you can write a script yourself with the necessary API calls and run it in the required number of threads.
Step 4. Canary Release
Unfortunately, it is impossible to foresee all possible errors in the release of new functionality, but their impact can be minimized by conducting a canary release (canary release) – this is a method of reducing the risk when introducing a new version into production by providing a change to a small subset of users with an increase to a state where this change is made available to everyone. At each step of increasing traffic to the release version, you need to monitor errors and be ready to roll back to the old version at any time (it is especially important to work out a plan to roll back migrations to the database in case of emergency situations).