Files
sics/doc/user/trouble.htm

210 lines
7.9 KiB
HTML

<html>
<head>
<title>SICS Trouble Shooting</title>
</head>
<body>
<h1>SICS Trouble Shooting </h1>
<hr size=4 width="66%">
<p>
There is no such thing as bug free software. There are always bugs, nasty
behaviour etc. This document shall help to solve these problems. The usual
symptom will be that a client cannot connect to the server or the server is
not responding. Or error messages show up. This section helps to solve such
problems.
</p>
<h2>Looking at Log Files</h2>
<p>
The first thing to do, especially when confronted with confusing statements
from either users or instrument scientists, is to look at the SICS servers
log files. The last 1000 lines of the instrument log are accessible from
any SICS client or through the WWW interface. The SICS commands:
<dl>
<dt>commandlog tail
<dd> shows the last 20 lines of the log.
<dt>commandlog tail n
<dd>shows the last n lines of the log.
</dl>
will show you the information available. In order to see more, log in to the
instrument account. There the following unix commands might help:
<ul>
<li><b>sicstail</b> shows the last 20 lines of the current log file and its
name
<li><b>sicstail n</b> shows the last n lines of the current log file.
</ul>
In order to see some more, cd into the log directory of the instrument
account. In there are files with names like:
<pre>
auto2001-08-08@00-01-01.log
</pre>
This means the log file has been started at August, 8, 2001 at 00:01:01.
There is a new log file daily. Load appropriate files into the editor and
look what really happened.
</p>
<p>
The log files show you all commands given and all the responses of the system.
Additionally there are hourly time stamps in the file which allow to narrow
in when the problem started. Things to watch out for are:
<dl>
<dt>MOTOR ALARM
<dd>This message means that the motor failed to reach his position for a
couple of times. This is caused by either a concrete shielding element
blocking the movement of the instrument, badly adjusted motor parameters,
mechanical failures or the air cushions not operating properly.
<dt>EL734__BAD_EMERG_STOP
<dd>Somebody has pushed the emergency stop button. This must be released
before the instrument can move again. Moreover the motor controller will
not respond to further commands in this mode. Thus restarting SICS on this
error message will make SICS fail to initialize the motors affected!
<dt>EL***__BAD_PIPE, BAD_RECV, BAD_ILLG, BAD_TMO, BAD_SEND
<dd>Network communication problems. Can generaly be solved by restarting
SICS.
<dt>EL737__BAD_BSY
<dd>A counting operation was aborted while the beam was off. Unfortunately,
the counter box does not respond to commands in this state and ignores the
stop command sent to it during the abort operation. This can be resolved by
the command:
<pre>
counter stop
</pre>
when the beam is on again.
</dl>
</p>
<h2>Starting SICS</h2>
<p>
An essential prerequisite of SICS is that the server is up
and running. The system is configured to restart the SICServer whenever it
fails. Only after a reboot or when the keepalive processes were killed (see
below) the SICServer must be restarted. This is done for all instruments by
typing:
<pre>
startsics
</pre>
at the command prompt. startsics actually starts two programs: one is
the replicator application which is responsible for the automatic
copying of data files to the laboratory server. The other is the SICS
server. Both programs are started by means of a shell script called
<b>keepalive</b>. keepalive is basically an endless loop which calls
the program again and again and thus ensures that the program will
never stop running.
</p>
<p>
When the SICS server hangs, or you want to enforce an reinitialization of
everything the server process must be killed. This can be accomplished either manually or through a shell script.
</p>
<h2>Stopping SICS</h2>
<p>
All SICS processes can be stopped through the command:
<pre>
killsics
</pre>
given at the unix command line. You must be the instrument user
(for example DMC) on the instrument computer for this to work properly.
</p>
<h2>Restart Everything</h2>
<p>
If nothing seems to work any more, no connections can be obtained etc, then
the next guess is to restart everything. This is especially necessary if
mechanics or electronics people were closer to the instrument then 400 meters.
<OL>
<LI> Reboot the histogram memory. It has a tiny button labelled RST. That' s
the one. Can be operated with a hairpin, a ball point pen or the like.
<LI> Wait 5 minutes.
<LI> Restart the SICServer. Watch for any messages about things not being
connected or configured.
<LI> Restart and reconnect the client programs.
</OL>
If this fails (even after a second) time there may be a network problem which
can not be resolved by simple means.
</p>
<h2>Checking SICS Startup</h2>
<p>
Sometimes it happens that the SICServer hangs while starting up or hardware
components are not properly initialized. In such cases it is useful to
look at the SICS servers startup messages. In order to do so, both the
SICServer and its keepalive process must be killed first. On the instrument
acount issue the command:
<pre>
ps -A | grep SICS
</pre>
A message like this will be printed:
<pre>
23644 ?? I 0:00.00 ksh keepalive SICServer focus.tcl
23672 ?? R 59:24.05 SICServer focus.tcl
7119 ttyp6 S + 0:00.00 grep SICS
</pre>
Remember the numbers in the first columns (the PID's) and kill both
programs by issuing the command:
<pre>
kill -9 pid pid
</pre>
Example:
<pre>
kill -9 23644 23672
</pre>
Note, the numbers are those displayed with the ps -A command.
Then cd into the bin directory of the instrument account and issue
the unix command:
<pre>
SICServer inst.tcl | more
</pre>
Replace inst.tcl with the name of the appropriate instrument initialisation
file. This allows to page through SICS startup messages and will help to
identify the troublesome component. The proceed to check the component and
the connections to it.
</p>
<h2>Getting New SICS Software</h2>
<p>
Sometimes you might want to be sure that you have the latest SICS software.
This is how to get it:
<ol>
<li>Login to the instrument account.
<li>If you are no there type cd to get into the home directory.
<li>Type <b>killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>sicsinstall exe</b> at the unix prompt for copying new
SICS software from the general distribution area.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
</p>
<h2>Hot Fixes</h2>
<p>
When there is trouble with SICS you may be asked by one of the SICS
programmers to copy the most recent development reason of the SICS server
to your machine. This is done as follows:
<ol>
<li>Login to the instrument account.
<li>cd into the bin directory, for example: /home/DMC/bin.
<li>Type <b> killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>cp /data/koenneck/src/sics/SICServer .</b> at the unix prompt.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
<b>!!!!!! WARNING !!!!!!!. Do this only when advised to do so by a competent
SICS programmer. Otherwise you might be copying a SICS server in an
instable experimental state!</b>
</p>
<h2> HELP debugging!!!!</h2>
<p>
The SICS server hanging or crashing should not happen. In order to sort such
problems out it is very helpful if any available debugging information is
saved and presented to the programmers. Information available are the log
files as written continously by the SICS server and posssible core files
lying around. They have just this name: core. In order to save them create a
new directory (for example dump2077) and copy the stuff in there. This looks
like:
<pre>
/home/DMC> mkdir dump2077
/home/DMC> cp log/*.log dump2077
/home/DMC> cp core dump2077
</pre>
The <tt>/home/DMC> </tt> is just the command prompt. Please note, that core
files are only available after crashes of the server. These few commands
will help to analyse the cause of the problem and to eventually resolve it.
</p>
</body>
</html>