Computing the Response Time per Transaction in IBM Maximo
Delays, delays, delays..
In this age of obsession with fast processing speed and the monetary loss for every delayed transaction, companies are throwing away money in upgrades each time they fail to spot the real bottleneck. I've had some experience with companies who went as far as adding another server or maxing out the server's maximum heap size because the only clues they are left with are the frequent server crashes and OutOfMemoryErrors.
Pointing Out the Wrong Bottleneck
It is easy to assume that a simple memory upgrade can resolve the problem when all you are presented with are errors related to memory. Most people who encounter descriptive errors assume that the only step that remains is to address the issue. The log says you're running out of memory? Add more! Simple, right? Wrong. Most of the time, these fast and simple solutions leave us frustrated when after a few days or so, we realize that it does not work and the issue remains. What's worse is that the computer still tells you the same error over and over again.
Maybe this is all a misunderstanding. Computers don't lie, they tell you every information they got but they are not native English speakers or even "human" speakers. All you have to do is know how to translate and use that information and make a proper analysis to decode what the computer means when it says this or that. When faced with another seemingly descriptive error such as an OutOfMemoryError, think deeper. Is it really an out of memory error? What caused this error? Is it safe to assume that the server simply couldn't handle the amount of transactions? Or are there other underlying issues that prevent the server from performing as well as they are supposed to?
This article will focus specifically on IBM Maximo but the same analysis can be done on other applications as well. Let's begin with our sources of data before we proceed to analysis.
The Message Tracking application is useful in checking when a message was received, processed, or encountered an error. The message ID, queue name, status, and interface name are all very important details that identify a specific transaction. If you want to know how long it took before a transaction was processed, you can simply subtract the received date from the processed date found in the message tracking history table. The result produced by this method is just a close estimate since milliseconds are usually not displayed or captured by the database.
The transaction's status. For this article, we'll only mention the RECEIVED and PROCESSED status.
The date and time when the transaction was received or processed.
The transaction's unique identifier.
However, this method works only in the perfect world where every transaction that lands in the queue are picked-up and processed immediately; otherwise, the amount of time when the message waited in the queue is added to your computation. The response time you've just computed becomes even more inaccurate as the application experiences down times or delays caused by the external systems you are interfacing with or any other cause.
System Log Files
The System log files are very useful as well. It provides information not usually found in the front-end or recorded in the database tables. Almost all activities within the application including what happens behind the scenes can be found here. The time when a cron task ran, picked-up the messages from the queue, processed each message one at a time (sequential), and finished processing all the messages it picked-up are all written in the logs. The best thing about the logs is the time stamp that includes milliseconds. You can see in great detail when each transaction was sent to the endpoint and when the endpoint sent back an HTTP Response OK (processed) or some error (timeout or other errors).
A Challenge In Identifying the Root Cause
When a Maximo application is integrated with external systems and uses many interfaces that share the same queue, it can be difficult to identify the root cause of delayed transactions processing. In some setups, there's only one inbound queue and one outbound queue shared by multiple interfaces.
On a very specific scenario, when a problem occurs in one of the outbound interfaces that slows down the processing of that interface's transactions, given the design of Maximo to process messages sequentially, the problematic interface's issue could affect the rest of the interfaces' transactions that are processed through the same outbound queue. This could cause hundreds or even thousands of messages piled up in the queue for hours. When this happens, the Maximo application becomes a primary suspect especially if the external systems can pick-up multiple transactions at the same time while Maximo can only pick-up another transaction for processing after the last one has finished.
You may result into packet-sniffing or tcpdump to see when the message was sent to the external system, when it was acknowledged by the external system, and when the external system sent an HTTP Response OK to complete the processing but this is usually not allowed by companies because it's a security breach.
So how can you prove whether the Maximo application itself is causing the bottleneck or not in a legal manner? How can you gather and present evidences using only the data and tools that you are allowed access to?
In order to make a convincing analysis, you must first embrace the existing biases that people already hold and keep that as a priority in your "disproving" list. In this case, that bias is the fact that the pick-up of every transaction in the queue is sequential. This makes people assume that Maximo is causing the delay. Be sure to keep your own biases in check. If you're the owner of the Maximo application, you will most likely blame the external systems that you are interfacing with. Now that you are aware of you and other people's biases, you can take a look at the gathered data fairly.
This analysis requires only the tools listed below. Both can be downloaded from the Internet for free. Aside from these tools, you also need the credentials to login to the server and to connect to the database.
- A terminal to access the server and system logs (e.g. MobaXterm, PuTTY, etc.)
- Any database administration query tool to run SQL scripts (e.g. Oracle SQL Developer, etc.)
Final and Filtered Results
The image above shows the final result of a query used to gather data on how delays have accumulated. The query joins information from the message tracking table and an imported log file. The transaction ID (TRANS), message ID (MEAMSGID), and interface name (IFACENAME) in the screenshot are covered to protect the company's data.
The result will be explained in detail later. First, let's briefly describe each column seen in the final result using the table below:
The row number of the transaction in the result of the query
The unique identifier inside the message body (e.g. work order number, asset number, etc.)
The unique identifier of the message/transaction itself regardless of the data inside its body
The name of the interface (used to segregate different services and types of transactions)
Classifies a transaction whether it is inbound (Sync) or outbound (Publish)
Classifies a transaction as "RECEIVED" or "PROCESSED"
The time and date when a transaction landed on the queue
The time and date when a transaction has been processed
The rounded up (to minutes) difference between the received and processed date/time
The difference between the received and processed date/time in seconds
The actual accumulated delay from the previous transaction to the current row's transaction
Detailed and Unfiltered Results
The image above shows the detailed result of the query. Notice the following in the result:
- There are no skipped row numbers since this is an unfiltered result.
- The rows are in ascending order of the PROCESSED_DATE's values.
- The boxed cells (in purple) in the PROCESSED_DATE column shows how the accumulated delay per transaction is computed (ACC_SEC).
- There is a negative value in the ACC_SEC column
While the filtered result shows the bottleneck transactions that contributed most to the delay, the unfiltered result shows every transaction in ascending order to show the build-up and decline of the accumulated delay per transaction. Let's use the boxed cells (in purple) as an example to see how the accumulated delay per transaction is computed.
Early Clues / Timeouts
A negative value in the ACC_SEC column is a good indicator that the processing time went faster and even took out a few seconds from the accumulated delay. A positive value should be a concern because this means that the delay is building up and unprocessed transactions could be piling up in the queues. If the positive value has reached 15 seconds or more, you will most likely find timeout errors in the logs. It depends on the configured timeout value, by default it is set to 15 seconds. A timeout occurs when 15 seconds (or any timeout setup) has passed and the server has not yet received a response from the external system. This can be an early indicator that the problem is with the external system and not the Maximo application itself.
Transaction RN#1328 was processed on 05/13/2016 15:38:40
Transaction RN#1329 was processed on 05/13/2016 15:38:54
It took 14 seconds before RN#1329 (current transaction) was processed from the time RN#1328 (previous transaction) was processed. The comparison of processed dates more accurately measures the accumulated delay and the increase/decrease in processing time of each transaction compared to the usual processed date minus received date which includes each transaction's "idle" time in the queues.