The Heart of the System: On The Need to Test And Monitor

One of these days I'll be able to follow my own advice. Yesterday I made a change to the email routing of my company. The routing was necessary, advisable and for the most part effective. It eliminated two (2) points of failure and cut off one datacentre's dependency on another, less-reliable, site.

But certain customers were cut off from their emailing. Not good, not good.

Going back to a post on ITIL's incident management, I'll restate what happened. At least the troubleshooting was fast and effective.

Incident: a customer complained that they were no longer receiving emails from a server I shall call "B.nospam.com."
Incident logging: taken care of by the incident management system (a custom job).
Incident categorization: the last thing that changed, the new routing, is the probable cause. Noted.
Incident or Service Request? Definitely an incident.
Incident Prioritization: This was, at first, a high-priority, medium-impact incident, level 2. Later I reprioritized to high-impact as it was more than one customer affected. Level 1: major incident. The boss was informed.
Initial Diagnosis: end-to-end check: I sent local emails from the server and followed the log on the server and on the routing email server. I checked the logs on both servers for the customer's domain name. And saw 'host customer.domain.com said: 450 4.1.8 <b.nospam.com>: Sender address rejected: Domain not found (in reply to RCPT TO command)' all over the place on the new routing server's mail log. Basically, emails coming from anywhere in the company should be from 'mailbox@nospam.com' and not from 'mailbox@server.nospam.com.'
Escalation: checking for the '450' error code indicated the problem was widespread. Priority 2 became Priority 1. However, since the answer was obvious I sent the escalation message to my boss after the fix.
Investigate and Diagnose: this part didn't need to be performed, since the initial diagnosis gave the faulty component: the "postfix" configuration on the new email router.
Resolve and Recover:

A quick check of "masquerade postfix" sent me to the Postfix website and the configuration line that was missing.
Experience told me that the change was quick and low-risk (and since I'm the change authority for the router I gave myself permission to make the change.)
I applied the line change and documented it in the build log for the server.
I tested by sending emails and checking for a correct masquerade.

Close the incident

The incident is not yet closed. I sent the confirmation message and asked for feedback but it hasn't come back yet.
I have created a "problem" ticket. Although the immediate cause of the problem was the faulty configuration, the root cause was my insufficient testing of the system when it was put in place.
I have also put in a change request to myself to add better monitoring of the email system. A great number of 450 errors were generated and never caught by automatic incident management.

ITIL loves to bandy the word "holistic" about. It means comprehending all parts of a system. In this case, I had tested much of the functionality, but not all of it. I must find some text on quality assurance for proper testing methodology. Perhaps ITIL has it in its Service Transition guide...

The Heart of the System

Pages

Tuesday, July 26, 2011

On The Need to Test And Monitor

No comments: