The premise

We have developed a high availability, high traffic set of web services for a specific client. The endpoint methods are by design entirely stateless, thus making scaling out a relatively easy task - actually the client has already deployed 3 different virtual machines, each serving the same endpoints.

There was a need for having a persistent logging mechanism, with ease of searching/filtering, and also quite versatile. There were four areas that were identified, each with their own different set of data to store:

  • Communication with external entities and services. We needed to be able to log all communication (requests and responses), named "transactions" or "TRANS". Mostly XML requests and responses, but we needed to be able to store the duration of the calls as well.
  • Exceptions that might occur during regular operations, named "errors" or "ERR". We'd store the initial request data and the entire exception (message and stack trace).
  • The actual request and response pairs of the method called, named "informational" or "INFO", XML requests and responses, apparently.
  • Debugging information, rather arbitrarily scattered throughout the entire codebase, depending on the need of the moment, named "debugging info" or "DEBUG". Usually plain text.

The client needed to be able to search fast and come up with meaningful results of what had happened with the web services. Usually, real time results would not be required.

The Issues

Initially we opted to use the web service main data storage for writing all logs, as the information would be easily queryable. Initially, we decided to store all types of logs under the same entity (let's call it "Log") which would have the following properties:

Id - identity, underlying type a Guid

Position - a string identifying the action that was taking place when the logging action occurred as well as the type of log

Message - a string, usually contains the web service method request as an XML. For the "TRANS" types, it contains the request sent to the third party

Extras - a string. Responses, stack traces, other information, depending on the type of logging action, all go here

Duration - integer, for TRANS logging actions only.

However, the web service is high traffic, producing several Mb (high double digits) of logging information per day. This caused several issues with the client:

The log became virtually unqueryable, past very basic filtering on dates. Then the client would have to sift manually through the log entries (often mounting to thousands per minute) to find the one they needed. Attempting to perform more complex queries would often result in the storage becoming unresponsive in general (thus the web service in its entirety would hang). Deleting old entries resulted in the exact same issue - the entire data storage would hang. Not being able to delete, posed extra issues, as the logs kept growing more and more.

Solution Attempt

Our first reaction was to split the single log entity into 4 (ERR, TRANS, INFO, DEBUG), at least to facilitate maintenance. By doing so though, we sacrificed versatility, as the client would now have to search through 4 different entities to be able to understand e.g. what had happened during a failed web service call, or a call with unexpected results. Queryability did not improve either, as the volume of information pushed into the data storage was so big that even with the new, split schema, complex queries would hang the data storage. And eventually, maintenance also had the same issues as before.

Click here to go to Part 2!