The Heart of the System: July 2011

I had the occasion today to build a brand new network from scratch. The network was a very simple one. Its only purpose for existence is to act as a buffer between my company's LAN and a government site accessed through a virtual private network (VPN).

I took the opportunity to write down the architecture of the network as configuration items, from the top down as recommended by the ITIL chapter on configuration management. I identified the route as the top component (Route to <Government Network>). From there I listed the other LANs (or rather, the virtual LANs: VLANs). Then I broke everything down into routers, switchport configurations, configuration files, static routes, one computer with just the two (2) NICs involved, the network address translation (NAT) scripts, everything.

Once I plonked down the CIs, I created the diagram. Once I created the diagram, I started experimenting with OTRS (Open source Ticket Request System) 's ITSM (IT Service Management) CMS (Configuration Management System) module. The number of default CI types (computer, hardware, software, locations) weren't sufficient so I had to create a bunch of new CI types. In OTRS, these are called ITSM::ConfigItem::Class items. Not easy to find the links to set these up! Then I needed to create a bunch of other ITSM::ConfigItem classes for locations and Ethernet interface types so I could create pull-down menus in the new CIs. Lastly, I needed to program (!) some Perl structures to handle the new CI types. OTRS is powerful, but the documentation is pretty sparse for their CMS. Fortunately, I have a grip on the code (or more to the point, copy-paste-edit from the defaults). Tomorrow, I'll put in the fields I've missed.

Eventually, I can tie in another system to audit the CIs in the OTRS CMS against their real-life objects! Looking forward, looking forward.

One of these days I'll be able to follow my own advice. Yesterday I made a change to the email routing of my company. The routing was necessary, advisable and for the most part effective. It eliminated two (2) points of failure and cut off one datacentre's dependency on another, less-reliable, site.

But certain customers were cut off from their emailing. Not good, not good.

Going back to a post on ITIL's incident management, I'll restate what happened. At least the troubleshooting was fast and effective.

Incident: a customer complained that they were no longer receiving emails from a server I shall call "B.nospam.com."
Incident logging: taken care of by the incident management system (a custom job).
Incident categorization: the last thing that changed, the new routing, is the probable cause. Noted.
Incident or Service Request? Definitely an incident.
Incident Prioritization: This was, at first, a high-priority, medium-impact incident, level 2. Later I reprioritized to high-impact as it was more than one customer affected. Level 1: major incident. The boss was informed.
Initial Diagnosis: end-to-end check: I sent local emails from the server and followed the log on the server and on the routing email server. I checked the logs on both servers for the customer's domain name. And saw 'host customer.domain.com said: 450 4.1.8 <b.nospam.com>: Sender address rejected: Domain not found (in reply to RCPT TO command)' all over the place on the new routing server's mail log. Basically, emails coming from anywhere in the company should be from 'mailbox@nospam.com' and not from 'mailbox@server.nospam.com.'
Escalation: checking for the '450' error code indicated the problem was widespread. Priority 2 became Priority 1. However, since the answer was obvious I sent the escalation message to my boss after the fix.
Investigate and Diagnose: this part didn't need to be performed, since the initial diagnosis gave the faulty component: the "postfix" configuration on the new email router.
Resolve and Recover:

A quick check of "masquerade postfix" sent me to the Postfix website and the configuration line that was missing.
Experience told me that the change was quick and low-risk (and since I'm the change authority for the router I gave myself permission to make the change.)
I applied the line change and documented it in the build log for the server.
I tested by sending emails and checking for a correct masquerade.

Close the incident

The incident is not yet closed. I sent the confirmation message and asked for feedback but it hasn't come back yet.
I have created a "problem" ticket. Although the immediate cause of the problem was the faulty configuration, the root cause was my insufficient testing of the system when it was put in place.
I have also put in a change request to myself to add better monitoring of the email system. A great number of 450 errors were generated and never caught by automatic incident management.

ITIL loves to bandy the word "holistic" about. It means comprehending all parts of a system. In this case, I had tested much of the functionality, but not all of it. I must find some text on quality assurance for proper testing methodology. Perhaps ITIL has it in its Service Transition guide...

The Heart of the System

Pages

Wednesday, July 27, 2011

Configuration Items Revisited

Tuesday, July 26, 2011

On The Need to Test And Monitor