Pages

Sunday, March 10, 2013

Logging OSB Webservice call response times - A simple approach

1. Overview
There are multiple ways of monitoring the service performance in OSB, as follows:
  1. Using OEM
  2. Using BTM
  3. Enabling monitoring for every Proxy service in OSB and writing scripts to collect and aggregate the data
  4. Explicitly adding logging in the pipeline: either using the Reporting Provider or another mechanism.
  5. Using an Audit WS-Policy
  6. Extending the Weblogic access.log to log the time taken and optionally pipeline metadata
  7. Other
If you have implemented items 1, 2 or 5, you can probably skip this post as you most likely have good monitoring coverage.

If you have implemented items 3, 4 or none, you should definitely read this post.

If you have implemented item 6, probably you already went through this post:)

  
2. Objective

The main objective of this post is to present a simple, effective and quick way to implement the first level of monitoring, or to extend your existing monitoring.

Remember: in order to have good monitoring coverage in place, there is no one size fits all solution. You need to monitor different components, and correlate the events.

If you have zero or limited monitoring coverage, this approach will bring you to a level which will either help you to troubleshoot issues or it will give you additional performance indicators.


3. Approach

The approach consists of extending the Weblogic access.log to log the time taken to give the response back to the consumer, and to log additional metadata.

By default Weblogic already logs (the most interesting information):
  • The timestamp
  • The URL (which is the service endpoint)
  • The HTTP response code
  • The payload response size in bytes
As you might have noticed, there is a limitation to this approach: It will cover only the HTTP transport, it does not cover JMS, MQ, Tuxedo, etc.

However, it will give you some sort of monitoring, which is better than nothing. On the positive side, you don't have to change your code to log the time taken, and the configuration change will take less than five minutes.


4. Using this approach in combination with other approaches

4.1 If you are using the OSB monitoring for the Proxy Services

In case you are using the OSB monitoring for your Proxy Services, you should consider adding time taken to the access.log, for the following reasons:
  •  The OSB monitoring mechanism is a great feature, but it will require that you collect the data regularly and store it somewhere in order to be able to trace performance trends. This is due to how the data is aggregated: It aggregates the data from the last statistics reset, and for the last X minutes, which is configurable.
  • If you use the information from the access.log, you will be able to aggregate it for whichever interval you want, as you have the raw data stored per service invocation.
  • Also, the OSB monitoring does not aggregate the response time in case of an error. You might be interested in this information.
  • The OSB monitoring response time aggregation does not compute the time before reaching the pipeline. In other words, if your authentication provider takes 2 seconds to authenticate a user, and the service execution takes 100 milliseconds, the OSB monitoring will compute only 100 milliseconds. On the other hand, the time taken in the access.log for this particular request will compute 2.1 seconds.

4.2 If you already have implemented the logging explicitly in the pipeline

  • The explicit log you have implemented will not be able to compute the time before reaching the pipeline. This means that the response time might be inaccurate in some circumstances. On the other hand, the time taken in the access.log will compute the time correctly.
  • In order to correlate the data from both log files: access.log and your log entries, you will be able to correlate them if you follow the next section, about logging additional metadata to the access.log
  • If you define a Workmanager and some requests start being rejected, the exception will not reach the OSB pipeline, which means that you will not be able to log using your pipeline log mechanism. On the other hand, an entry in the access.log will be generated and the HTTP error code will be 503.


5. Adding time taken to the access.log

5.1 Configuration


  • Access the Weblogic console.
  • Select Environment - Servers -  and each of the managed servers
  • Select Logging - HTTP
  • Select Advanced
  • Select Format=Extended (as per the following screenshot, selection 1)
  • Enter in the Extended Logging Format Fields the following value (as per the following screenshot, selection 2)

c-ip date time time-taken cs-method cs-uri sc-status bytes

  • Click on Save


Note: Only for test purposes, you can change the Log File Buffer (selection 3 above), to zero. This will make it easier to see the output in the access.log file, otherwise it will write to the file whenever the buffer reaches 8kb (the default value).

  • Repeat the above procedure for each of the managed servers
  • Activate the session
  • Restart each of the managed servers


5.2 Log File output

If you look at the managed server access.log, you will see the following output. You will notice highlighted the time taken in seconds.


#Version:       1.0
#Fields:        c-ip date time time-taken cs-method cs-uri sc-status bytes
#Software:      WebLogic
#Start-Date:    2013-03-07      00:25:02
127.0.0.1       2013-03-07      00:28:53        1.468  POST     /CustomerV1/Customer  200     1207
127.0.0.1       2013-03-07      00:28:57        0.063   POST    /CustomerV1/Customer  200     1207
127.0.0.1       2013-03-07      00:28:58        0.052   POST    /CustomerV1/Customer  200     1207
127.0.0.1       2013-03-07      00:28:58        0.045   POST    /CustomerV1/Customer  200     1207
127.0.0.1       2013-03-07      00:28:59        0.012   POST    /CustomerV1/Customer  200     1207


6. Adding transactionID to access.log

In case you have implemented your own logging inside your Pipeline, and you want also to benefit from the access.log metrics, you might be wondering how to correlate events. You could possibly try to correlate based on the URI and the timestamp, however this is not 100% reliable, especially if you have a high throughput.

The aim here is to add a correlation in the access.log. You can achieve this by:

  • Adding a response HTTP header in your pipeline, containing the transactionID
  • Changing the Weblogic Extended Logging Format Fields, adding the transactionID header
Those two steps are documented below:

6.1 Adding a response HTTP Header in the pipeline


You will have to add a transport header action as follows in your Proxy Service. It's recommended to add it in your global exception handling as well, in case an exception occurs.




The transport header is called: "transactionID". 
This header will be populated with the value from $messageID, which is automatically generated by OSB and it's the same value I'm using in my pipeline auditing, in order to correlate both events.





6.2 Adding the transactionID to be logged in the access.log

In order to log the transactionID in the access.log, you will have to repeat step 5.1, but this time you will set the following value for the "Extended Logging Format Fields":

c-ip date time time-taken cs-method cs-uri sc-status bytes sc(transactionID)

6.3 Log File output


If you look at the managed server access.log, you will see the following output. You will notice highlighted the transactionID.
In order to make it easier to visualise each of the following records, I've added an additional break line.



#Version:       1.0
#Fields:        c-ip date time time-taken cs-method cs-uri sc-status bytes sc(transactionID)
#Software:      WebLogic
#Start-Date:    2013-03-10      16:46:25
127.0.0.1       2013-03-10      16:54:23        1.393   POST    CustomerV1/Customer 200     1207    "5129230022325267856-60c5b97c.13d52d60fb3.-7fde"

127.0.0.1       2013-03-10      16:54:39        0.034   POST    CustomerV1/Customer 200     1207    "5129230022325267856-60c5b97c.13d52d60fb3.-7fdc"

127.0.0.1       2013-03-10      16:54:40        0.017   POST    CustomerV1/Customer 200     1207    "5129230022325267856-60c5b97c.13d52d60fb3.-7fda"

127.0.0.1       2013-03-10      16:54:41        0.05    POST    CustomerV1/Customer 200     1207    "5129230022325267856-60c5b97c.13d52d60fb3.-7fd8"

127.0.0.1       2013-03-10      16:54:42        0.013   POST    CustomerV1/Customer 200     1207    "5129230022325267856-60c5b97c.13d52d60fb3.-7fd6"




7. Aggregating access.log data

Once the access.log files are generated, you need to aggregate the information.
This is out of scope of this post. I might write another post exploring different options for:

  • Loading the data in realtime to a repository
  • Making the data searchable
  • Aggregating the data


8. Conclusion

Just to emphasise what I mentioned in the beginning of this post: This approach is a quick win. You can have it implemented with low effort and under 5 minutes time, and it will give you additional metrics for your environment.

I often hear organisations saying: "We are selecting the perfect tool, which will monitor everything".

The bottom line is: Usually the tool selection and implementation could take months or years, depending on the company's bureaucracy.

If you have an application in production you cannot wait. You need a pragmatic monitoring approach RIGHT NOW.

This is just one out of many monitoring approaches. Its coverage is limited, therefore it should be used in combination with other approaches.

I will post other monitoring approaches, which will complement the one explained in this post.


1 comment:

Your comment