Files
sics/doc/manager/trouble.htm
2012-11-15 12:39:51 +11:00

136 lines
5.1 KiB
HTML

<html>
<head>
<title>SICS Trouble Shooting</title>
</head>
<body>
<h1>SICS Trouble Shooting </h1>
<hr size=4 width="66%">
<H2>Check Server Status</h2>
<p>
One of the first things to do is to check the server status with:
monit status.
</p>
<hr size=4 width="66%">
<H2>Inspecting Log Files</h2>
<p>
Suppose something went wrong over the weekend or during the night and
you are not absolutely sure what the problem was. In such a case it is
helpful to look at the SICS log files. They live in the log directory
of the instrument account. For each day (or after each restart of the
SICS server) a new log file is created. They are named according to the
following convention:
<pre>
autoYYYY-mm-dd@hh-MM-ss.log
</pre>
with YYYY denoting the year, mm the month, dd the day, hh the hour of
creation, MM the minute of creation and ss the seconds of
creation. The most recent log file can be looked at with the
<b>sicstail</b> command. <b>sicstail num</b> shows the last num lines
of the log file. Within SICS and especially in the SICS command line
client, the last 1000 lines of the log are accessible through the
<b>commandlog tail num</b> command. The command log is also accessible
through the WWW at lns00. The log file is equipped with hourly time
stamps which allow to find out when exactly a problem began to
appear.
</p>
<p>
There is also another log file, log/monit.log, which logs messages from
the monit daemon. This can be used to determine when server processes
were restarted or when hardware failed.
</p>
<p>
Quite often the inspection of the log files will indicate problems
which are not software related such as:
<ul>
<li>Communication problems (usually network)
<li>Positioning problems of motors.
<li>BAD_EMERG_STOP: the motor emergency stop was engaged. It must be
released before the motors move again.
<li>BAD_STP: a motor had been switched off.
</ul>
</p>
<h2>Restarting SICS</h2>
<hr size=4 width="66%">
<p>
<dl>
<dt>monit restart sicsserver
</dl>
</p>
<hr size=4 width="66%">
<h2>Restart Everything</h2>
<p>
If nothing seems to work any more, no connections can be obtained etc, then
the next guess is to restart everything. This is especially necessary if
mechanics or electronics people were closer to the instrument then a
nautical mile.
<uL>
<LI> Reboot the histogram memory. It has a tiny button labelled RST. That' s
the one. Can be operated with a hairpin, a ball point pen or the like.
<li>Restart all of SICS with the sequence: monit stop all; monit quit; monit
<li>Wait for a couple of minutes for the system to come up.
</ul>
</p>
<hr size=4 width="66%">
<h2>Starting SICS Manually</h2>
<p>
In order to find out if some hardware is broken or if the SICS server
initializes badly it is useful to look at the SICS servers startup messages.
The following steps are required:
<ul>
<li>monit stop sicsserver
<li>cd ~/inst_sics
<li>./SICServer inst.tcl | more
</ul>
Replace inst by the name of the instrument, as usual. Look at the screen
output in
order to find out why SICS does not initialize things or where the
initialization hangs. Do not forget to kill the SICServer thus started when
you are done and to issue the command: <b>monit start sicsserver</b> in order
to place the SICS server back under monits control again.
</p>
<hr size=4 width="66%">
<h2>Test the SerPortServer Program</h2>
<p>
Sometimes the SerPortServer program hangs and inhibits the communication with
the RS-232 hardware. This can be diagnosed by the following procedure: Find
out at which port either a EL734 motor controller or a E737 counter box
lives. Then type:<b>asyncom localhost 4000 portnumber</b> This yields a
new prompt at which you type <b>ID</b>. If all is well a string identifying
the device will be printed. If not a large stack dump will come up.
The asyncom program can be exited by typing <b>quit</b>. If there is
a problem with the
SerPortServer program type: <b>monit restart SerPortServer</b> in order to
restart it.
</p>
<hr size=4 width="66%">
<h2>Trouble with Environment Devices</h2>
<p>
The first stop for trouble with temperature or other environment devices
is Markus Zolliker. A common problem is that old environment controllers
have not be deconfigured from the system and still reserve terminal server
ports. Thus take care to deconfigure your old devices when swapping.
</p>
<hr size=4 width="66%">
<h2> HELP debugging!!!!</h2>
<p>
The SICS server hanging or crashing should not happen. In order to sort such
problems out it is very helpful if any available debugging information is
saved and presented to the programmers. Information available are the log
files as written continously by the SICS server and posssible core files
lying around. They have just this name: core. In order to save them create a
new directory (for example dump2077) and copy the stuff in there. This looks
like:
<pre>
/home/DMC> mkdir dump2077
/home/DMC> cp log/*.log dump2077
/home/DMC> cp core dump2077
</pre>
The <tt>/home/DMC> </tt> is just the command prompt. Please note, that core
files are only available after crashes of the server. These few commands
will help to analyse the cause of the problem and to eventually resolve it.
</p>
</body>
</html>