Server Upgrade Hell

This week's project: Replace CPUs on both of the servers. Why are we doing this? The hardware has been mis-reporting the speeds of both CPUs on a regular basis now for some time. I only just began to notice it as one of the machines started to labor a bit trying to perform normal tasks. A quick check of the sysinfo page for that system showed the CPU had been downclocked to 1GHz. It's normal speed was 2.4GHz. The other server exhibited this from time to time as well but usually fixed itself. The eventual conclusion I came to was that both CPUs had finally met their end. So it came time to order up a pair of replacements. Taking the speed up to 2.6GHz and cutting the wattage from 85W to 45W. A pretty nice upgrade that's also not terribly expensive.

Parts involved: AMD 5050e 45W Energy efficient socket AM2 CPUs, running @ 2.6GHz.

Time expected for the swap: 30 minutes total, both machines.

The results of this CPU swap were more or less uneventful. Took the old ones out, cleaned the fans, put the new ones in. Both boxes fired up and so I figured I was done. But alas, it all turned out to be not as easy as one might expect. Both boxes were now reporting incorrectly. The BIOS did not recognize the CPU. Given that the motherboards in these boxes were older than the old CPUs, this shouldn't have been much of a shock. Unfortunately this wasn't solvable with a simple BIOS update. Why you might ask? Our buddies over at Abit have long since gone out of business. So no BIOS update. And the CPU performance had become unstable to the point of needing this fixed one way or another. So, enter part 2 of the project.

Parts involved: ASUS M2N68-AM SE2 motherboards. Built in VGA, network, SATA, the whole bit.
Time expected for the swap: 1 hour total, for both machines.

This is where things turned ugly, and fast.

First box I swapped out, the system refused to power on. The CPU fan spun up for 2 seconds, then everything died. After checking all the cables and making sure nothing was shorting against the board, I tried again with no luck. So I disconnected all of the power using devices from the board, leaving only the CPU. Upon trying to get it to power up this way, nothing happened. Fearing I might have picked up a bad board, I reluctantly started the teardown. I'm not even sure what I found counts as a rookie mistake since the CPU fan should not have mounted properly this way, but as it turned out, the CPU locking arm on the socket was sitting straight up. DUH. So that got reassembled, hooked up, and when it powered on and I could get to the BIOS, I figured all is well. Box 1 upgraded.

Second box went smoothly, especially since I made sure the damn socket arm was down this time. Always amazes me that hardware swapping goes faster on the second box even though the process has to be close to the same amount of time. In any case, once the parts were swapped in I hooked it all back up and turned it on. BIOS came up. Both boxes verified CPU as recognized and properly configured. Go me.

Except, not go me. Murphy was out in force tonight. After getting switched back to the Windows box to connect and make sure everything was happy, the ton of bricks came down. No network. Couldn't get through by SSH. Tried to connect to the internet via the Windows box, no dice, as DNS was also not responding. So this meant something bad happened.

A quick investigation revealed that the eth0 interface (your normal network config) was missing. The system failed to start it. So I checked the usual place. /etc/sysconfig/network-scripts looked intact. Except trying to start it kept insisting the device did not exist. Some poking around in boot log messages proved otherwise as the kernel was seeing it just fine. For those who might already suspect where this is going, yes, they're both nVidia MCP61 network ports using the forcedeth driver. Which by the way worked fine with Fedora 11 on the old boards, so I had no reason to think the module itself was borked.

So about this time Tarl noticed DNS was down here, and asked what's up. After relaying what was going on, we went through a major digging fest through the network scripts, modprobe files, tried various commands which all failed to find the device. During all this we both noticed the MAC address in the ifcfg-eth0 file was wrong. It didn't match the one the kernel had identified in the boot logs. So we both figured on trying that. It still failed, claiming once more that the device did not exist. This was leading down a rather nasty path and some vague memories I had of nVidia and the kernel people getting into pissing matches in the past over closed source drivers. I began to wonder if I'd fallen victim to some stupid ego trip game. On that subject, I feel the need to say this: You GPL zealots can take your license and shove it up your asses. Stop torturing end-users with these elitist games and maybe, just maybe, linux might become more friendly to use. Your insistence on code communism is your major weakness.

Anyway, we were clearly both doing some head scratching, especially after I noticed the /lib/modules file containing the forcedeth.ko file was intact and in the right location for the Fedora 11 distro. It was becoming appealing to simply shut the things down and go pick up some NICs in the morning from Fry's. Not an ideal solution, but this is what happens when one does major upgrades in the middle of the night when nothing is open. There didn't seem to be much else left to try.

Then it came. Tarl had one last trick up his sleeve. Apparently something to do with udev and hardware rules. A grep command finally turned up something useful other than "device unknown" type results. The file in question: /etc/udev/rules.d/70-persistent-net.rules This little innocuous looking thing buried inside the udev rules is more or less akin to the old days of yore when people still used ini files and hardware profiles for Windows 3.11. Yeah, this is that archaic. Anyway, inside the file was the MAC address to the old network interface, which is what grep found. The surprise came when I noticed the only other entry there was for the NEW network interface, which the system had simply played stupid and called eth1. One might note that it didn't bother to generate a working profile for that, or to tell me there was such a thing. It just tossed it in, with the right MAC address and all. So the solution became stupidly simple. Delete the old entry and mark the new one as eth0. Reboot.

Long story short, you're here now reading this. So the above solution solved the problem. Some 6 hours after the motherboard swappage began.

A quick rundown for future reference:

The error: "forcedeth device eth0 does not seem to be present, delaying initialization" - This comes up when booting the system, or trying to launch "ifup eth0"

When upgrading a motherboard, if your network suddenly fails to work for no reason you can figure out, look here: /etc/udev/rules.d/70-persistent-net.rules

# This file was automatically generated by the /lib/udev/write_net_rules
# program run by the persistent-net-generator.rules rules file.
# You can modify it, as long as you keep each rule on a single line.

# PCI device 0x10de:0x03ef (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:24:8c:d1:a2:02", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

The problem I had was this file originally held a second entry with the old MAC address, and a notation saying it had been placed by an external tool. Bad shit, because there is only one port on the new board.

After editing that file, go to /etc/sysconfig/network-scripts/ifcfg-eth0 and make sure the MAC address in there matches the one the udev file found. Then reboot your machine.
"It is pointless to resist, my son." -- Darth Vader
"Resistance is futile." -- The Borg
"Mother's coming for me in the dragon ships. I don't like these itchy clothes, but I have to wear them or it frightens the fish." -- Thurindil

Well. I guess that's that then.

« Transformers: Revenge of the Fallen
Harry Potter and the Half-Blood Prince »

Posted on Jul 9, 2009 5:37 pm by Samson in: | 5 comment(s) [Closed]
Yeah, well, when they decide to stop changing the way half the systems on the OS work with every iteration maybe I'll have more patience to put up with the quirks of linux. But they show no signs of doing that. Plus there was no real excuse for the system to cling to a network port that didn't exist on the bus anymore.

UH... this comment should have appeared under Conner's. Looks like the system clock was fubared at some point too.

Don't you just love it when it takes six hours to find out that you needed to manually edit a two line file down to a one line file? *L*
Of course, leaving the CPU socket arm up so that it blocked the cpu fan was probably a good indicator that it wasn't going to be a quick easy night.. :D

hmm, one last vestige of a troubled upgrade?

I understand what you're saying but that's the joy of an open source OS, everyone's got their little bit to throw in. *shrug* ..agreed, it should have updated automatically but ini style files don't always, never was a perfect system to use, but it does work most of the time.

Looks like. I guess the clocks were off too on top of everything else. But that seems to be sorted now.

Open source isn't the problem. It's the constant fascination they all seem to have with changing things on every version update. You can't get any kind of consistency from behavior like that. Things that worked before stop working. Or in the case of this udev stuff, introduce entirely new behavior that hasn't been adequately tested.

Then of course you have the application developers. You know, the ones like the ClamAV guys who between version 0.9.4 and 0.9.5 decided they just had to redo the entire architecture of the virus scanning engine. They made so many changes the system no longer functions and was impeding the delivery of email. None of the old setup guides work and in typical linux style they don't have any sort of documentation on the site for the app at all other than a bunch of FAQs that are worded to imply you don't know WTF you're doing if you need to read them.

Well, at least the clock being off is a very minor snafu. :)

That's probably why I don't always stay entirely up-to-date and even then do my updates on my workstation before I do them on my servers.. ;)

Oh yeah, I love documentation like that. It's almost as good as the old modems that only listed a BBS phone number (invariably long distance at that) for tech support. :D

<< prev 1 next >>
Comments Closed
Comments for this entry have been closed.

Forgot Password?

 1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31