PSI sics-cvs-psi-2006

This commit is contained in:
2006-05-08 02:00:00 +00:00
committed by Douglas Clowes
parent ae77364de2
commit 6e926b813f
388 changed files with 445529 additions and 14109 deletions

View File

@@ -6,6 +6,12 @@
<h1>SICS Trouble Shooting </h1>
<hr size=4 width="66%">
<H2>Check Server Status</h2>
<p>
One of the first things to do is to check the server status with:
monit status.
</p>
<hr size=4 width="66%">
<H2>Inspecting Log Files</h2>
<p>
Suppose something went wrong over the weekend or during the night and
@@ -29,6 +35,11 @@ stamps which allow to find out when exactly a problem began to
appear.
</p>
<p>
There is also another log file, log/monit.log, which logs messages from
the monit daemon. This can be used to determine when server processes
were restarted or when hardware failed.
</p>
<p>
Quite often the inspection of the log files will indicate problems
which are not software related such as:
<ul>
@@ -42,178 +53,65 @@ released before the motors move again.
<h2>Restarting SICS</h2>
<hr size=4 width="66%">
<p>
There is no such thing as bug free software. There are always bugs, nasty
behaviour etc. This document shall help to solve these problems. The usual
symptom will be that a client cannot connect to the server or the server is
not responding.
</p>
<p>
An essential prerequisite of SICS is that the servers are up
and running. The system is configured to restart the SICServer whenever it
fails. Only after a reboot or when the keepalive processes were killed (see
below) the SICServer must be restarted. This is done for all instruments by
typing:
<pre>
startsics
</pre>
at the command prompt. startsics actually starts several programs, see
the Setup section for details. All programs are started by means of a
shell script called
<b>keepalive</b>. keepalive is basically an endless loop which calls
the program again and agaian and thus ensures that the program will
never stop running.
</p>
<p>
When the SICS server hangs, or you want to enforce an reinitialization of
everything the server process must be killed. This can be accomplished either manually or through a shell script.
</p>
<h2>Stopping SICS</h2>
<p>
All SICS processes can be stopped through the command:
<pre>
killsics
</pre>
given at the unix command line. You must be the instrument user
(for example DMC) on the instrument computer for this to work properly.
</p>
<h2>Finding the SICS server</h2>
<p>The first thing when killing the SICS server manually is to find the
server process.
Log in as Instrument user on the instrument computer (for instance DMC on
lnsa05). Type the command:
<pre>
/home/DMC> ps -A
</pre>
Note the capital A given as parameter. The reward will be listing like this:
<pre width =132>
PID TTY S TIME CMD
0 ?? R 01:56:28 [kernel idle]
1 ?? I 1:24.44 /sbin/init -a
3 ?? IW 0:00.20 /sbin/kloadsrv
24 ?? S 40:39.58 /sbin/update
97 ?? S 0:04.87 /usr/sbin/syslogd
99 ?? IW 0:00.03 /usr/sbin/binlogd
159 ?? S 1:43.70 /usr/sbin/routed -q
285 ?? S 1:00.45 /usr/sbin/portmap
293 ?? S 6:03.45 /usr/sbin/ypserv
299 ?? I 0:00.37 /usr/sbin/ypbind -s -S psunix,lnsa05.psi.ch
307 ?? I 0:00.52 /usr/sbin/mountd -i
309 ?? I 0:00.07 /usr/sbin/nfsd -t8 -u8
311 ?? I 0:00.09 /usr/sbin/nfsiod 7
317 ?? S 5:51.54 /usr/sbin/automount -f /etc/auto.master -M /psi
370 ?? I 0:28.58 -accepting connections (sendmail)
389 ?? S 1:41.15 /usr/sbin/xntpd -g -c /etc/ntp.conf
419 ?? S 6:00.16 /usr/sbin/snmpd
422 ?? S 1:00.91 /usr/sbin/os_mibs
438 ?? S 34:29.67 /usr/sbin/advfsd
449 ?? I 3:16.29 /usr/sbin/inetd
482 ?? IW 0:11.53 /usr/sbin/cron
510 ?? IW 0:00.02 /usr/lbin/lpd
525 ?? I 5:31.67 /usr/opt/psw/psw_agent -x/dev/null -f/usr/opt/psw/psw_agent.conf
532 ?? I 0:00.74 /usr/opt/psw/psw_sensor_syswd 1 -x/dev/null
555 ?? I 0:00.58 /usr/bin/nsrexecd
571 ?? I 0:20.27 /usr/dt/bin/dtlogin -daemon
583 ?? S 1:38.27 lpsbootd -F /etc/lpsodb -l 0 -x 1
585 ?? IW 0:00.04 /usr/sbin/getty /dev/lat/620 console vt100
586 ?? IW 0:00.03 /usr/sbin/getty /dev/lat/621 console vt100
587 ?? I 35:59.85 /usr/bin/X11/X :0 -auth /var/dt/authdir/authfiles/A:0-aaarBa
657 ?? I 0:01.46 rpc.ttdbserverd
4705 ?? IW 0:00.05 dtlogin -daemon
9127 ?? I 0:00.37 /usr/bin/X11/dxconsole -geometry 480x150-0-0 -daemon -nobuttons -verbose -notify -exitOnFail -nostdin -bg gray
9317 ?? IW 0:00.73 dtgreet -display :0
14412 ?? S 0:39.71 netscape
15524 ?? I 0:00.57 rpc.cmsd
21678 ?? S 0:00.11 telnetd
31912 ?? S 0:10.65 /home/DMC/bin/SICServer /home/DMC/bin/dmc.tcl
584 console IW + 0:00.21 /usr/sbin/getty console console vt100
21978 ttyp1 S 0:00.63 -tcsh (tcsh)
22269 ttyp1 R + 0:00.10 ps -A
</pre>
This is a listing of all running processes on the machine where this command
has been typed. Note, in this case, at the bottom in the line starting with
<tt> 31912 ?? </tt> an entry for the SICS server. In this example the server
is running. If the server is down, no such entry would be present.
</p>
<h2> Killing a hanging SICS server </h2>
<p>
Suppose, the situation is that the SICS server does not respond anymore. It
needs to be forcefully exited. Please note, that it is always better to
close the server via the <tt>Sics_Exitus</tt> command typed with manager
privilege in one of the command clients. In order to kill the server it is
needed to find him first using the scheme given above. The information
needed is the number given as first item in the same line where the server
is listed. In this case: <tt>31912</tt>. Please note, that this number will
always be different. The command to force the server to stop is:
<pre>
/home/DMC> kill -9 31912
</pre>
Note, the second parameter is the number found with <tt>ps -A</tt>. The
SICServer will be restarted automatically by the system. Occasionally, it
may happen, that you cannot connect to the SICS server after such an
operation. This is due to some network buffering problems. Doing the killing
again usually solves the problem.
</p>
<h2> Shutting The SICS Server Down Completely</h2>
<p>
This is done for you by the killsics shell script. Just type
<pre>
killsics
</pre>
at the unix command line. Here is what killsics does for you:
In order to completely shutdown the SICS server two process must be killed:
the actual SICS server and the process which automatically restarts the
SICServer. The latter must be killed first. It can be found in the ps -A
listing as a line reading <b>keepalive SICServer </b>. Kill that one as
described above, then kill the SICServer. For restarting SICS after this,
use the startsics command.
<dl>
<dt>monit restart sicsserver
</dl>
</p>
<hr size=4 width="66%">
<h2>Restart Everything</h2>
<p>
If nothing seems to work any more, no connections can be obtained etc, then
the next guess is to restart everything. This is especially necessary if
mechanics or electronics people were closer to the instrument then 400 meters.
<OL>
mechanics or electronics people were closer to the instrument then a
nautical mile.
<uL>
<LI> Reboot the histogram memory. It has a tiny button labelled RST. That' s
the one. Can be operated with a hairpin, a ball point pen or the like.
<LI> Restart the SICServer. Watch for any messages about things not being
connected or configured.
<LI> Restart and reconnect the client programs.
</OL>
If this fails (even after a second) time there may be a network problem which
can not be resolved by simple means.
<li>Restart all of SICS with the sequence: monit stop all; monit quit; monit
<li>Wait for a couple of minutes for the system to come up.
</ul>
</p>
<h2>Getting New SICS Software</h2>
<hr size=4 width="66%">
<h2>Starting SICS Manually</h2>
<p>
Sometimes you might want to be sure that you have the latest SICS software.
This is how to get it:
<ol>
<li>Login to the instrument account.
<li>If you are no there type cd to get into the home directory.
<li>Type <b>killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>sicsinstall exe</b> at the unix prompt for copying new
SICS software from the general distribution area.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
In order to find out if some hardware is broken or if the SICS server
initializes badly it is useful to look at the SICS servers startup messages.
The following steps are required:
<ul>
<li>monit stop sicsserver
<li>cd ~/inst_sics
<li>./SICServer inst.tcl | more
</ul>
Replace inst by the name of the instrument, as usual. Look at the screen
output in
order to find out why SICS does not initialize things or where the
initialization hangs. Do not forget to kill the SICServer thus started when
you are done and to issue the command: <b>monit start sicsserver</b> in order
to place the SICS server back under monits control again.
</p>
<h2>Hot Fixes</h2>
<hr size=4 width="66%">
<h2>Test the SerPortServer Program</h2>
<p>
When there is trouble with SICS you may be asked by one of the SICS
programmers to copy the most recent development reason of the SICS server
to your machine. This is done as follows:
<ol>
<li>Login to the instrument account.
<li>cd into the bin directory, for example: /home/DMC/bin.
<li>Type <b> killsics</b> at the unix prompt in order to stop the SICS server.
<li>Type <b>cp /data/koenneck/src/sics/SICServer .</b> at the unix prompt.
<li>Type <b> startsics</b> to restart the SICS software.
</ol>
<b>!!!!!! WARNING !!!!!!!. Do this only when advised to do so by a competent
SICS programmer. Otherwise you might be copying a SICS server in an
instable experimental state!</b>
Sometimes the SerPortServer program hangs and inhibits the communication with
the RS-232 hardware. This can be diagnosed by the following procedure: Find
out at which port either a EL734 motor controller or a E737 counter box
lives. Then type:<b>asyncom localhost 4000 portnumber</b> This yields a
new prompt at which you type <b>ID</b>. If all is well a string identifying
the device will be printed. If not a large stack dump will come up.
The asyncom program can be exited by typing <b>quit</b>. If there is
a problem with the
SerPortServer program type: <b>monit restart SerPortServer</b> in order to
restart it.
</p>
<hr size=4 width="66%">
<h2>Trouble with Environment Devices</h2>
<p>
The first stop for trouble with temperature or other environment devices
is Markus Zolliker. A common problem is that old environment controllers
have not be deconfigured from the system and still reserve terminal server
ports. Thus take care to deconfigure your old devices when swapping.
</p>
<hr size=4 width="66%">
<h2> HELP debugging!!!!</h2>
<p>
The SICS server hanging or crashing should not happen. In order to sort such