Late Sept News Part I and SGI Woes

SGI, HPC with cost as no object; but what did reliability cost?

SGI has been a very interesting new road for me, it was the better part of 10 years before I was able to clinch ahold of Octane2 un-obtainium. After gaining a better understanding of how the Octane / Onyx racks work with their NUMA Link and Crossbars, I knew that I wanted to have one as a server. I knew that it would be fast, and different.

Just like in this video; “SGI IRIS 2400” SGI advertises that cost is no object in bringing performance computing to the world.. However I wonder if reliability was part of the cost? Just like someone I follow on the internet I too have had a bit of bad luck with this lately:

Now this is only the beginning part of the story, but there is much more to be told, and I know how my story ends which is a bit sad. I have been lead to believe that any old SGI box is not reliable enough anymore to be called a “server”. It has been interesting to observe that hardware a mere 7 years old is already as flaky as this stuff is. WHY!? I don’t understand it, when I opened up the power supply there were even name brand components in there, they seemed to be of decent quality, Nichicon and Rubicon caps, yet they were still wore out. Yeah, I may not know how many hours were on the servers before I got them, but I can tell you that I have seen Sun machines go non stop for 10 years, without issues. My Sparc Server 1000 from years past had only a couple of bad ram sticks when I replaced them and fixed it up sometime around 2003 – 2004; So from 1993 – 2003, 2 bad ram sticks..

When I opened the power supply, nothing appeared to have gone wonky, but it acted otherwise. It was pretty interesting to see a bunch of VRM Errors caused by a bad power supply.

Power Supply opened up after causing multiple VRM faults on an Altix 450. Replacing the power supply with a known good one corrected the faults.

Check out the input / output rails on this Altix Power Supply. 108 Amps!? :-O

Going back to the SGI system above I did get the Altix 450 to power up, NUMA link working between two bricks, I even got Debian to start loading on it, but then just like on a familiar youtube youtube channel “that’s where it all went pear shaped”. The system hung, and I wasn’t getting anything out of the controllers on the brick that had all the node boards in it. After doing a hard shut down, the L1/L2 controller on the brick never came back up, suffering the same fate that the first brick did in the video above. 🙁 I have video of all of this, and it’s coming very soon.

So what am I going to do? Well, I am going to give the whole project the midfinger, and sell the node boards on eBay since they’re all good. I suspect the issue that I am having is because of a real time clock battery issue. Since the date and time on neither system were set correctly, I wonder if that is why stuff no longer powers on. Heck I may try it later today just to see what happens.

One thing is for sure, I am never going to use SGI systems as a server, I appreciate my V880 in a whole new way. I still have not had any issues with it, and even recovered from what amounted to be some potentially BAD disasters thanks to a few things. The Altix requires the brick you are working on to be powered down when changing out nodes, disks, cards, etc. My v880 that’s 5 years older support all of that as hotplug, with the exception of the CPU boards I guess. The writer in the blog I mention below puts it best:

Sun’s products really represent a level of innovation and quality that is hard to match. The cost of Sun’s servers and storage products are often said to be too high, but when doing a like for like comparison, Sun products at the same price as those of the competition have better performance, features, power consumption, rack density, upgradability, investment protection, manageability and build quality. I know many of Sun’s products well because of the way we work in the SSA (Sub Saharan African) region – The engineers here all support all of Sun products, bar none.


Sun’s product line includes Storagetek’s tape libraries and VTL / VSM mainframe products. It includes the Fujitsu based M9000-64. It includes the Constellation blades and switches, and little servers like the T2000 and even many smaller, though older V210s and V240s that are still being used. In January I had to replace an EEPROM chip on an Ultra 5 on a scientific vessel in the Cape Town harbor!

Sun also have a major investment in SPARC processors. No, they are not the fastest number crunchers, in fact the UltraSPARC processors are quite slow when compared to the fastest processors from the competition. But scaling to hundreds of CPUs is mature technology in the SPARC camp. Multi-core CPU technology: Mature. Multi-terabyte RAM in a system: Mature technology. NUMA? Mature technology. 64-bit processors? Those came out, what, 12 years ago. Systems with over 700 GB/sec internal bus bandwidth? We got it. Adding memory or CPUs to a running system. Mature, and available for 10 years already. You can even remove those components and repair or replace them without stopping the OS, though it does require that you configure the server correctly.

Since Sun allowed their engineers to get creative, they innovated at an incredible pace, and their products were nothing short of awesome. If you enjoyed the excerpt from above, you’ll definitely want to click below.

Someone else’s blog I found-

It’s called “Initial Program Load”, for those who aren’t aware of what the title means, it’s the name for the first step in booting an IBM as/400 or s/390. It was hours of good reading, when I should have been telling you more about what I’m doing.

Page 2 was really good; so many good stories and guides.


I am selling my stereo, I am very sad to see it go. It is by far the best set of speakers and amp I have ever had, and as long as it is for sale it is up on Craigslist, you should be able to find it by going to the home page and clicking on the craigslist postings link.


