Profiling web endpoints for fun and profit

Overview

At VXSoft we treat performance as a feature and try to address issues related to it with extreme seriousness. Our products are mainly web applications, thus affected by all performance (e.g. traffic spikes, poorly written code, database performance) and security issues (OWASP top ten) of general web applications living in the wild. Applications have different deployment types, varying from single cloud to on site isolated installations.

The isolated installations are hard to debug and monitor, thus the difficulties with detecting reasons for occasional performance degradation cases. So we needed a solution and preferably a reusable one to detect problematic web endpoints causing this degradation, after which the engineering team would be able to narrow down the specter of possible investigation subjects.

Logs

Some installations of our systems are isolated from the public network and are even inaccessible to our customers' support team, which was the main reason that real time web application monitoring tools wouldn't suite us. So the decision our team ended up with was to collect logs on those installations and process them later on our side. For that goal, we defined following steps:
Fig. 1. Log collection and processing flow

Log format

For each request/endpoint following information is logged into a single log file:

{timestamp in ISO format};{HTTP method};{URI};{elapsed}
Fig. 2. Defined log format

The defined log format contains only the minimal necessary information for identifying slow endpoints, also collecting interesting insights about application usage, ratio of HTTP methods, etc. An important aspect is that the log doesn't contain any personal or confidential information of our clients.

After enabling logging in the application's configuration file, the system starts collecting estimation time log of all endpoints resulting in the following CSV log file:

2021-10-01T12:06:41.009819;GET;/?;0.0012438558042049408
2021-10-01T12:06:41.043376;GET;/am/dashboard/tab/1?;0.057040732353925705
2021-10-01T12:06:41.018917;GET;/am/widgets/5/render?task_id=1306383&incoming_id=0&outgoing_id=0&internal_id=1306383&tab_id=263;0.04639996960759163
...
Fig. 3. Log sample

Log collector

Our web applications are Python WSGI applications, which use uWSGI as an application server and stand behind an NGINX reverse proxy. We decided to collect defined metrics only for our web application by skipping the web server and application server. It gave us an opportunity to estimate the performance of our system by ignoring network latency and other third party dependencies. Also it was mandatory for log collector to be web framework agnostic as we use different frameworks for our applications. For example, here is the integration of perflog log collector with the wheezy.web framework:

application = WSGIApplication(
    [
        bootstrap_defaults(url_mapping=URLS),
        http_error_middleware_factory,
        path_routing_middleware_factory
    ],
    options=config.options
)

# enable performance logging if configuration is set
if config.config.getboolean('logging', 'perf-log', fallback=False):
    application = perflog(application, 'var/log/perflog.csv')
Fig. 4. Enabling performance logging for Wheezy.Web

As it can be seen from fig. 4, log collector simply wraps existing WSGI application, thus making it universal for other compliant frameworks. Making it configurable allows the application's owner to control log collection duration.

import atexit
from time import perf_counter
from datetime import datetime


def perflog(app: callable, logfile: str) -> callable:
    fd = open(logfile, 'a')
    atexit.register(lambda: fd.close())

    def wsgi_application(env: dict, start_response: callable):
        elapsed = perf_counter()
        result = app(env, start_response)
        elapsed = perf_counter() - elapsed
        fd.write(
            ''.join((
                datetime.now().isoformat(),
                ';',
                env['REQUEST_METHOD'],
                ';',
                f"{env['PATH_INFO']}?{env['QUERY_STRING']}",
                ';',
                str(elapsed),
                '\n',
            ))
        )
        fd.flush()
        return result
    return wsgi_application
Fig. 5. WSGI application wrapper for performance log collection

Fig. 4 presents the source code of WSGI application wrapper, which collects logs in defined format for each request. It doesn't affect application performance by itself and is very simple for maintenance.

Log processing

After collecting logs for some time and receiving them from clients (fig. 1), the next step is processing them. As the defined log format is custom, we decided to built a small custom utility to parse and visualize the data, for which we decided to go with Jupyter notebook for its ability of quick prototyping and support of powerful visualization tools.  It is also available to export composed notebook as an HTML file providing decent reporting functionality.

Performance logging report example

Fig. 6. Performance logging report example

In the notebook (fig. 6) pandas is used for data processing with a combination of Plotly for displaying interactive and beautiful charts.

Top slowest requests
a) Top slowest requests

distribution of HTTP methods

a) distribution of HTTP methods

Fig. 7. Performance report analysis

By using our logging and processing technique we have been able to identify top slowest requests (fig. 7. a) over time, distribution of HTTP methods (fig. 7. b) and other interesting insights. By isolating slow endpoints we can then start digging further in that direction in order to find the true causes of slow performance.

Conclusion

Performance of a software is a critical and integral part of it. During an application's lifetime its performance can both worsen and improve depending on new futures, proper monitoring etc. But identification of the true origin of performance degradation is a very hard problem itself.

Due to the specifics of our clients and applications we are developing and supporting, in many cases we don't have access to the deployment, thus making it very hard to monitor or collect any runtime information. That's why we have developed our own solution for web application's endpoints' runtime estimation collection and analyzing. We were able to use that information and reports based on it to isolate endpoints causing performance degradation for further investigations.

No comments:

Powered by Blogger.