Files
sics/doc/manager/trouble.htm
cvs db6c355f44 - Enhanced and debugged histogram memory for AMOR
* added PROJECT both in HM and driver code
  * added single detector support.
- Removed several bugs in the AMOR data bit.
- Updated documentation
2001-08-17 14:33:05 +00:00

238 lines
10 KiB
HTML

<html>
<head>
<title>SICS Trouble Shooting</title>
</head>
<body>
<h1>SICS Trouble Shooting </h1>
<hr size=4 width="66%">
<H2>Inspecting Log Files</h2>
<p>
Suppose something went wrong over the weekend or during the night and
you are not absolutely sure what the problem was. In such a case it is
helpful to look at the SICS log files. They live in the log directory
of the instrument account. For each day (or after each restart of the
SICS server) a new log file is created. They are named according to the
following convention:
<pre>
autoYYYY-mm-dd@hh-MM-ss.log
</pre>
with YYYY denoting the year, mm the month, dd the day, hh the hour of
creation, MM the minute of creation and ss the seconds of
creation. The most recent log file can be looked at with the
<b>sicstail</b> command. <b>sicstail num</b> shows the last num lines
of the log file. Within SICS and especially in the SICS command line
client, the last 1000 lines of the log are accessible through the
<b>commandlog tail num</b> command. The command log is also accessible
through the WWW at lns00. The log file is equipped with hourly time
stamps which allow to find out when exactly a problem began to
appear.
</p>
<p>
Quite often the inspection of the log files will indicate problems
which are not software related such as:
<ul>
<li>Communication problems (usually network)
<li>Positioning problems of motors.
<li>BAD_EMERG_STOP: the motor emergency stop was engaged. It must be
released before the motors move again.
<li>BAD_STP: a motor had been switched off.
</ul>
</p>
<h2>Restarting SICS</h2>
<hr size=4 width="66%">
<p>
There is no such thing as bug free software. There are always bugs, nasty
behaviour etc. This document shall help to solve these problems. The usual
symptom will be that a client cannot connect to the server or the server is
not responding.
</p>
<p>
An essential prerequisite of SICS is that the servers are up
and running. The system is configured to restart the SICServer whenever it
fails. Only after a reboot or when the keepalive processes were killed (see
below) the SICServer must be restarted. This is done for all instruments by
typing:
<pre>
startsics
</pre>
at the command prompt. startsics actually starts several programs, see
the Setup section for details. All programs are started by means of a
shell script called
<b>keepalive</b>. keepalive is basically an endless loop which calls
the program again and agaian and thus ensures that the program will
never stop running.
</p>
<p>
When the SICS server hangs, or you want to enforce an reinitialization of
everything the server process must be killed. This can be accomplished either manually or through a shell script.
</p>
<h2>Stopping SICS</h2>
<p>
All SICS processes can be stopped through the command:
<pre>
killsics
</pre>
given at the unix command line. You must be the instrument user
(for example DMC) on the instrument computer for this to work properly.
</p>
<h2>Finding the SICS server</h2>
<p>The first thing when killing the SICS server manually is to find the
server process.
Log in as Instrument user on the instrument computer (for instance DMC on
lnsa05). Type the command:
<pre>
/home/DMC> ps -A
</pre>
Note the capital A given as parameter. The reward will be listing like this:
<pre width =132>
PID TTY S TIME CMD
0 ?? R 01:56:28 [kernel idle]
1 ?? I 1:24.44 /sbin/init -a
3 ?? IW 0:00.20 /sbin/kloadsrv
24 ?? S 40:39.58 /sbin/update
97 ?? S 0:04.87 /usr/sbin/syslogd
99 ?? IW 0:00.03 /usr/sbin/binlogd
159 ?? S 1:43.70 /usr/sbin/routed -q
285 ?? S 1:00.45 /usr/sbin/portmap
293 ?? S 6:03.45 /usr/sbin/ypserv
299 ?? I 0:00.37 /usr/sbin/ypbind -s -S psunix,lnsa05.psi.ch
307 ?? I 0:00.52 /usr/sbin/mountd -i
309 ?? I 0:00.07 /usr/sbin/nfsd -t8 -u8
311 ?? I 0:00.09 /usr/sbin/nfsiod 7
317 ?? S 5:51.54 /usr/sbin/automount -f /etc/auto.master -M /psi
370 ?? I 0:28.58 -accepting connections (sendmail)
389 ?? S 1:41.15 /usr/sbin/xntpd -g -c /etc/ntp.conf
419 ?? S 6:00.16 /usr/sbin/snmpd
422 ?? S 1:00.91 /usr/sbin/os_mibs
438 ?? S 34:29.67 /usr/sbin/advfsd
449 ?? I 3:16.29 /usr/sbin/inetd
482 ?? IW 0:11.53 /usr/sbin/cron
510 ?? IW 0:00.02 /usr/lbin/lpd
525 ?? I 5:31.67 /usr/opt/psw/psw_agent -x/dev/null -f/usr/opt/psw/psw_agent.conf
532 ?? I 0:00.74 /usr/opt/psw/psw_sensor_syswd 1 -x/dev/null
555 ?? I 0:00.58 /usr/bin/nsrexecd
571 ?? I 0:20.27 /usr/dt/bin/dtlogin -daemon
583 ?? S 1:38.27 lpsbootd -F /etc/lpsodb -l 0 -x 1
585 ?? IW 0:00.04 /usr/sbin/getty /dev/lat/620 console vt100
586 ?? IW 0:00.03 /usr/sbin/getty /dev/lat/621 console vt100
587 ?? I 35:59.85 /usr/bin/X11/X :0 -auth /var/dt/authdir/authfiles/A:0-aaarBa
657 ?? I 0:01.46 rpc.ttdbserverd
4705 ?? IW 0:00.05 dtlogin -daemon
9127 ?? I 0:00.37 /usr/bin/X11/dxconsole -geometry 480x150-0-0 -daemon -nobuttons -verbose -notify -exitOnFail -nostdin -bg gray
9317 ?? IW 0:00.73 dtgreet -display :0
14412 ?? S 0:39.71 netscape
15524 ?? I 0:00.57 rpc.cmsd
21678 ?? S 0:00.11 telnetd
31912 ?? S 0:10.65 /home/DMC/bin/SICServer /home/DMC/bin/dmc.tcl
584 console IW + 0:00.21 /usr/sbin/getty console console vt100
21978 ttyp1 S 0:00.63 -tcsh (tcsh)
22269 ttyp1 R + 0:00.10 ps -A
</pre>
This is a listing of all running processes on the machine where this command
has been typed. Note, in this case, at the bottom in the line starting with
<tt> 31912 ?? </tt> an entry for the SICS server. In this example the server
is running. If the server is down, no such entry would be present.
</p>
<h2> Killing a hanging SICS server </h2>
<p>
Suppose, the situation is that the SICS server does not respond anymore. It
needs to be forcefully exited. Please note, that it is always better to
close the server via the <tt>Sics_Exitus</tt> command typed with manager
privilege in one of the command clients. In order to kill the server it is
needed to find him first using the scheme given above. The information
needed is the number given as first item in the same line where the server
is listed. In this case: <tt>31912</tt>. Please note, that this number will
always be different. The command to force the server to stop is:
<pre>
/home/DMC> kill -9 31912
</pre>
Note, the second parameter is the number found with <tt>ps -A</tt>. The
SICServer will be restarted automatically by the system. Occasionally, it
may happen, that you cannot connect to the SICS server after such an
operation. This is due to some network buffering problems. Doing the killing
again usually solves the problem.
</p>
<h2> Shutting The SICS Server Down Completely</h2>
<p>
This is done for you by the killsics shell script. Just type
<pre>
killsics
</pre>
at the unix command line. Here is what killsics does for you:
In order to completely shutdown the SICS server two process must be killed:
the actual SICS server and the process which automatically restarts the
SICServer. The latter must be killed first. It can be found in the ps -A
listing as a line reading <b>keepalive SICServer </b>. Kill that one as
described above, then kill the SICServer. For restarting SICS after this,
use the startsics command.
</p>
<h2>Restart Everything</h2>
<p>
If nothing seems to work any more, no connections can be obtained etc, then
the next guess is to restart everything. This is especially necessary if
mechanics or electronics people were closer to the instrument then 400 meters.
<OL>
<LI> Reboot the histogram memory. It has a tiny button labelled RST. That' s
the one. Can be operated with a hairpin, a ball point pen or the like.
<LI> Restart the SICServer. Watch for any messages about things not being
connected or configured.
<LI> Restart and reconnect the client programs.
</OL>
If this fails (even after a second) time there may be a network problem which
can not be resolved by simple means.
</p>
<h2>Getting New SICS Software</h2>
<p>
Sometimes you might want to be sure that you have the latest SICS software.
This is how to get it:
<ol>
<li>Login to the instrument account.
<li>If you are no there type cd to get into the home directory.
<li>Type <b>killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>sicsinstall exe</b> at the unix prompt for copying new
SICS software from the general distribution area.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
</p>
<h2>Hot Fixes</h2>
<p>
When there is trouble with SICS you may be asked by one of the SICS
programmers to copy the most recent development reason of the SICS server
to your machine. This is done as follows:
<ol>
<li>Login to the instrument account.
<li>cd into the bin directory, for example: /home/DMC/bin.
<li>Type <b> killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>cp /data/koenneck/src/sics/SICServer .</b> at the unix prompt.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
<b>!!!!!! WARNING !!!!!!!. Do this only when advised to do so by a competent
SICS programmer. Otherwise you might be copying a SICS server in an
instable experimental state!</b>
</p>
<h2> HELP debugging!!!!</h2>
<p>
The SICS server hanging or crashing should not happen. In order to sort such
problems out it is very helpful if any available debugging information is
saved and presented to the programmers. Information available are the log
files as written continously by the SICS server and posssible core files
lying around. They have just this name: core. In order to save them create a
new directory (for example dump2077) and copy the stuff in there. This looks
like:
<pre>
/home/DMC> mkdir dump2077
/home/DMC> cp log/*.log dump2077
/home/DMC> cp core dump2077
</pre>
The <tt>/home/DMC> </tt> is just the command prompt. Please note, that core
files are only available after crashes of the server. These few commands
will help to analyse the cause of the problem and to eventually resolve it.
</p>
</body>
</html>