Diagnosing random crashes
Last revised: Aug 9, 2003
Solaris will exercise everything on a PC. I have seen machines run just fine with Windows for months and crash during a Solaris x86 install. The reason was always marginal parts/specs/timings/temperatures that were never pushed to the failure limits until Solaris x86 was installed.
Possible causes of random system panic
First: verify that your CPU fan is running and that there are no loose fan cables etc. that might come into contact with the CPU fan and slow it down or stop it.
Next: it is very likely that you have poor/marginal memory or other components. There are three possibilities:
1: Testing RAM
Exercise all your RAM as follows:
Find a large file somewhere on your system and copy it to RAM, by copying it to /tmp. Do something like:
Now pick the largest file, which will be at the bottom of the listing. Copy the file to RAM and make copies of it to force the machine to use all available RAM before using swap space.
For example, assume that the file called libCstd.so.1 is the largest file in /usr/lib. First do:
and see how much swap you have defined and how much is free. Let us assume your machine has 1GB of swap free and 512MB of RAM. You need to create a scenario where the system will gobble up all RAM and get seriously into your available swap space. But you don't want to use all swap space, or you may create problems maintaining control of your box. So - again taking totally bogus numbers - assume you have 512KB of RAM and 1GB of swap free and that your big file from /usr/lib is 2MB in size. We will go for a target of approx 1.2GB of RAM + swap to consume in order to ensure that we are testing/using all your available RAM locations:
cp /usr/lib/libCstd.so.1 /tmp/t
You can always do a swap -l, between each of these file copies to see how much swap is being consumed. Stop before you run low on swap space.
You have just exercised all available RAM locations. If your machine has not crashed, then you have probably proved that all your RAM is good. Now, don't forget to cleanup:
rm /tmp/t /tmp/t1
and run the same test again.
If your machine has survived this test, then checkout theory #2.
2: Possible component overheating
The easiest way to do this, is to simply take the covers off the machine, and point the biggest window fan you can find into the machine (run it at maximal speed). This will keep everything in the box cool. Also verify that the hard drives have some airflow across the HDA (Head Disk Assembly). Remember that the drive does not care if the airflow is left to right, right to left, front to back, back to front etc. As long as there is some airflow across the HDA. The reason that this is important, is that even a single byte error in swap space (due to an over heated disk) will eventually crash your OS. And maybe not immediately - it could be several hours or days before the OS retrieves that particular corrupted swapped data and then it may simply corrupt a system binary which might not immediately crash the machine.
If you don't have airflow across the disk, jury rig any 12 volt PC fan to blow air across the drive for the purpose of testing this theory. Later followup with improved cooling for the drive to ensure that the drive will outlive its warranty period!
I often do test #1 above, after I built/loaded a new machine, as a quick validation that I don't have some, infrequently accessed, bad RAM in the machine.
Another possible scenario, is to remove RAM SIMMs, one at a time and try to isolate the failure to one part. Although if you make it to this point, then it probably makes more sense to use a rigorous memory test program to try to weed out marginal RAM parts. Run these test programs overnight.
Copyright © 2003 by Al Hopper. All rights reserved.