Home / Documents / Tutorials / Crashdiag
 
 
 
 
 
 
 
 
 
 
 
 
 DOCUMENTS 
 TUTORIALS 

Diagnosing random crashes

Al Hopper

Last revised: Aug 9, 2003

Solaris will exercise everything on a PC. I have seen machines run just fine with Windows for months and crash during a Solaris x86 install. The reason was always marginal parts/specs/timings/temperatures that were never pushed to the failure limits until Solaris x86 was installed.

Possible causes of random system panic

First: verify that your CPU fan is running and that there are no loose fan cables etc. that might come into contact with the CPU fan and slow it down or stop it.

Next: it is very likely that you have poor/marginal memory or other components. There are three possibilities:

  1. You are accessing a memory location that does not normally get used until the machine starts to run low on RAM - the time it takes for the system to get to this area of RAM will vary, depending on what you are using the machine for, and how busy the machine is.
  2. Memory timing is marginal. When the system heats up, that marginal memory timing will slip below a threshold that will cause the machine to fail. Your support chipset (or something else) may be running too hot and failing when it heats up. (Does your support chipset have a heatsink attached?)
  3. Ensure that you are running a quality power supply (PSU) which can supply more than the rated power requirements of your motherboard and associated peripherals - including hard drives and PCI plugin cards. For the 800MHz capable P4 parts, you need to pay particular attention to the 12 Volt rating of the power supply. A marginal PSU will often run out of steam when it heats up. I would suggest you checkout the Zalman ZM400A-APF recently reviewed favorably at tomshardware.com and available online for approx $80 at www.directron.com (usual disclaimers... just a happy customer). There are too many junky PSUs being shipped these days.

Solutions

1: Testing RAM

Exercise all your RAM as follows:

Find a large file somewhere on your system and copy it to RAM, by copying it to /tmp. Do something like:

cd /usr/lib
ls -al | sort -k 5n

Now pick the largest file, which will be at the bottom of the listing. Copy the file to RAM and make copies of it to force the machine to use all available RAM before using swap space.

For example, assume that the file called libCstd.so.1 is the largest file in /usr/lib. First do:

swap -l

and see how much swap you have defined and how much is free. Let us assume your machine has 1GB of swap free and 512MB of RAM. You need to create a scenario where the system will gobble up all RAM and get seriously into your available swap space. But you don't want to use all swap space, or you may create problems maintaining control of your box. So - again taking totally bogus numbers - assume you have 512KB of RAM and 1GB of swap free and that your big file from /usr/lib is 2MB in size. We will go for a target of approx 1.2GB of RAM + swap to consume in order to ensure that we are testing/using all your available RAM locations:

cp /usr/lib/libCstd.so.1 /tmp/t
cd /tmp
cat t t t t t> t1
# so now file t1 is approx 10MB (5 * 2MB = 10MB)
mv t1 t
cat t t t t t> t1
# so now file t1 is approx 50MB
mv t1 t
cat t t t t t> t1
# so now file t1 is approx 250MB
mv t1 t
cat t t t t t>t1
# so now t1 is approx 1.25GB and /tmp/t is 250MB,
# so you are using 1.5GB of RAM + swap space.

You can always do a swap -l, between each of these file copies to see how much swap is being consumed. Stop before you run low on swap space.

You have just exercised all available RAM locations. If your machine has not crashed, then you have probably proved that all your RAM is good. Now, don't forget to cleanup:

rm /tmp/t /tmp/t1

and run the same test again.

If your machine has survived this test, then checkout theory #2.

2: Possible component overheating

The easiest way to do this, is to simply take the covers off the machine, and point the biggest window fan you can find into the machine (run it at maximal speed). This will keep everything in the box cool. Also verify that the hard drives have some airflow across the HDA (Head Disk Assembly). Remember that the drive does not care if the airflow is left to right, right to left, front to back, back to front etc. As long as there is some airflow across the HDA. The reason that this is important, is that even a single byte error in swap space (due to an over heated disk) will eventually crash your OS. And maybe not immediately - it could be several hours or days before the OS retrieves that particular corrupted swapped data and then it may simply corrupt a system binary which might not immediately crash the machine.

If you don't have airflow across the disk, jury rig any 12 volt PC fan to blow air across the drive for the purpose of testing this theory. Later followup with improved cooling for the drive to ensure that the drive will outlive its warranty period!

I often do test #1 above, after I built/loaded a new machine, as a quick validation that I don't have some, infrequently accessed, bad RAM in the machine.

Another possible scenario, is to remove RAM SIMMs, one at a time and try to isolate the failure to one part. Although if you make it to this point, then it probably makes more sense to use a rigorous memory test program to try to weed out marginal RAM parts. Run these test programs overnight.

Copyright © 2003 by Al Hopper. All rights reserved.

Logo
Top
Last modified: 2003-08-10