Mathworks MATLAB DISTRIBUTED COMPUTING SERVER 4 System Administrator’s Guide

Page 1
MATLAB®Distribut
System Administrator’s Guide
ed Computing Server™ 4
Page 2
How to Contact The MathWorks
www.mathworks. comp.soft-sys.matlab Newsgroup www.mathworks.com/contact_TS.html Technical Support
suggest@mathworks.com Product enhancement suggestions
bugs@mathwo doc@mathworks.com Documentation error reports service@mathworks.com Order status, license renewals, passcodes
info@mathwo
com
rks.com
rks.com
Bug reports
Sales, prici
ng, and general information
508-647-7000 (Phone)
508-647-7001 (Fax)
The MathWorks, Inc. 3 Apple Hill Drive Natick, MA 01760-2098
For contact information about worldwide offices, see the MathWorks Web site.
®
MATLAB
© COPYRIGHT 2005–20 10 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The softwar e may be used or copied only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form without prior written consent from The MathW orks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through the federal government of the United States. By accepting delivery of the Program or Documentation, the government hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern theuse,modification,reproduction,release,performance,display,anddisclosureoftheProgramand Documentation by the federal government (or other entity acquiring for o r through the federal government) and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the government’s needs or is inconsistent in any respect with federal procurement law, the government agrees to return the Program and Docu mentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. S ee
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
The MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Distributed Computing Server™ System Administrator’s Guide
Page 3
Revision History
November 2005 Online only New for Version 2.0 (Release 14SP3+) December 2005 Online only Revised for Version 2.0 (Release 14SP3+) March 2006 Online only Revised for Version 2.0.1 (Release 2006a) September 2006 Online only Revised for Version 3.0 (Release 2006b) March 2007 Online only Revised for Version 3.1 (Release 2007a) September 2007 Online only Revised for Version 3.2 (Release 2007b) March 2008 Online only Revised for Version 3.3 (Release 2008a) October 2008 Online only Revised for Version 4.0 (Release 2008b) March 2009 Online only Revised for Version 4.1 (Release 2009a) September 2009 Online only Revised for Version 4.2 (Release 2009b) March 2010 Online only Revised for Version 4.3 (Release 2010a)
Page 4
Page 5
Introduction
1
Product Overview ................................. 1-2
Overview Determining Product Installation and Versions
........................................ 1-2
......... 1-3
Contents
Toolbox and Server Components
Job Managers, Workers, and Clients Third-Party Schedulers Components on Mixed Platforms o r Heterogeneous
Clusters
mdce Service
Using Parallel Computing Toolbox Software
....................................... 1-7
..................................... 1-7
............................ 1-6
.................... 1-4
.................. 1-4
......... 1-8
Network Administration
2
Preparing for Parallel Computing ................... 2-2
Before You Start Planning Your Network Layout Network Requirements Fully Qualified Domain Names Security Considerations
Installing and Configuring
.................................. 2-2
...................... 2-2
............................. 2-3
...................... 2-3
............................ 2-4
......................... 2-5
Using a Different MPI Build on UNIX Operating
Systems
Building MPI Using Your MPI Build
........................................ 2-6
..................................... 2-6
............................. 2-6
v
Page 6
Shutting Down a Job Manager Configuration ........ 2-9
UNIX and Macintosh Operating Systems Microsoft Windows Operating Systems
.............. 2-9
................ 2-11
Customizing Server Services
Defining the Script Defaults Overriding the Script Defaults
Accessing Service Record Files
Locating Log Files Locating Checkpoint Directories
Troubleshooting
License Errors Memory Errors on UNIX Operating Systems Running Server Processes from a Windows Network
Installation Required Ports Ephemeral TCP Ports with Job Manager Host Communications Problems Verifying Multicast Communications
................................... 2-19
.................................... 2-19
.................................... 2-21
.................................... 2-21
3
....................... 2-13
......................... 2-13
....................... 2-15
..................... 2-17
................................. 2-17
..................... 2-18
........... 2-21
.............. 2-23
..................... 2-23
................. 2-25
Admin Center
vi Contents
Starting A dm in Center ............................. 3-2
Setting Up Resources
Adding Hosts Starting a Job Manager Starting Workers Stopping, Destroying, Resuming, Restarting Processes Moving a Worker Updating the Display
Testing Connectivity
Saving and Loading Sessions
..................................... 3-3
.............................. 3-3
............................ 3-4
.................................. 3-5
... 3-7
.................................. 3-8
.............................. 3-8
............................... 3-9
....................... 3-13
Page 7
Preparing for User Configurations .................. 3-14
Control Script Reference
4
mdce Process Control .............................. 4-2
Job Manager Control
Worker Control
5
.............................. 4-2
.................................... 4-2
Control Scripts — Alphabetical List
Glossary
Index
vii
Page 8
viii Contents
Page 9
Introduction
1
This chapter provides an introduction to the concepts and terms of Parallel Computing Toolbox™ software and MATLAB Server™ software.
“Product Overview” on page 1-2
“Toolbox and Server Components” on page 1-4
“Using Parallel Computing Toolbox Software” on page 1-8
®
Distributed Computing
Page 10
1 Introduction
Product Overview
Overview
Parallel Computing Toolbox and MATLAB Distributed Computing S erver software let y ou solve computationally and data-intensive problems using MATLAB Parallel processing constructs such as parallel for-loops and code blocks, distributed arrays, parallel numerical algorithms, and message-passing functions let you implement task-parallel and data-parallel algorithms at a high level in MATLAB without programming for specific hardware and network architectures.
A job is some large operation that you need to perform in your MATLAB session. A job is broken down into segments called tasks.Youdecidehowbest to divide your job into tasks. You could divide your job into identical tasks, but tasks do not have to be identical.
In this section...
“Overview” on page 1-2
“Determining Product Installation and Versions” on page 1-3
®
and Simulink®on multicore and multiprocessor computers.
1-2
The MATLAB session in which the job an d its tasks are defined is called the client session. Often, this is on the machine where you program MATLAB. The client uses Parallel Computing Toolbox software to perform the definition of jobs and tasks. The MATLAB Distributed Computing Server product performs the execution of your job by evaluating each of its tasks and returning the result to your client session.
The job manager is the part of the server software that coordinates the execution of jobs and the evaluation of their tasks. The job manager distributes the tasks for evaluation to the server’s individual MATLAB sessions called workers. Use of the MathWorks™ job manager is optional; the distribution of tasks to workers can also be performed by a third-party scheduler, such as Window HPC Server (including CCS), a Platform LSF scheduler, or a PBS Pro®scheduler.
®
Page 11
Product Overview
See the “Glossary” on page Glossary-1 for definitions of the parallel computing termsusedinthismanual.
MATLAB Worker
MATLAB Distributed
Computing Server
MATLAB Client
Parallel
Computing
Toolbox
Scheduler
or
Job Manager
MATLAB Worker
MATLAB Distributed
Computing Server
MATLAB Worker
MATLAB Distributed
Computing Server
Basic Parallel Computing Configuration
Determining Product Installation and Versions
To determine if Parallel Computing Toolbox software is installed on your system, type this command at the MATLAB prompt:
ver
When you enter this command, MATLAB displays information about the version of MATLAB you are running, including a list of all toolboxes installed on your system and their version numbers.
You can run the application to determine what version of MATLAB Distributed Computing Server software is installed on a worker machine. Note that the toolbox and server software must be the same version.
ver command as part of a task in a distributed or parallel
1-3
Page 12
1 Introduction
Toolbox and Ser ver Components
In this section...
“Job Managers, Workers, and Clients” on page 1-4
“Third-Party Schedulers” on page 1-6
“Components on Mixed Platforms or Heterogeneo us Clusters” on page 1-7
“mdce Service” on page 1-7
Job Managers, Workers, and Clients
The o ptional job manager can run on any machine on the network. The job manager runs jobs in the order in which they are submitted, unless any jobs in its queue are promoted, demoted, canceled, or destroyed.
Each worker receives a task of the runningjobfromthejobmanager,executes the task, returns the result to the job manager, and then receives another task. When all tasks for a running job have been assigned to workers, the job manager starts running the next job with the next available worker.
1-4
A MATLAB Distributed Computing Server network configuration usually includes m any workers that can all execute tasks simultaneously, speeding up execution of large MATLAB jobs. It is generally not important which worker executes a specific task. Each worker evaluates tasks one at a time, returning the results to the job manager. The job manager then returns the results of all the tasks in the job to the client session.
Note For testing your application locally or other purposes, you can configure a single computer as client, worker, and job manager. You can also have more than one worker session or more than one job manager session on a machine.
Page 13
Toolbox and Server Components
Task
Job
Results
Worker
Client
All Results
Job
Client
All Results
Interactions of Parallel Computing Sessions
A large network might include several job managers as well as several client sessions. Any client session can create, run, and access jobs on any job manager, but a worker session is registered with and dedicated to only one job manager at a time. The following figure shows a configuration with multiple job managers.
Scheduler
or
Job Manager
Task
Worker
Results
Task
Worker
Results
Worker
Client
Scheduler
or
Worker
Job Manager
Worker
Client
Client
Scheduler
Client
Configuration with Multiple Clients and Job Managers
or
Job Manager
Worker
Worker
Worker
1-5
Page 14
1 Introduction
Third-Party Sch
As an alternativ third-party sch (including CCS scheduler, mpi
eduler. This could be a Microsoft ), Platform LSF scheduler, PBS Pro scheduler, TORQUE
exec, or a generic scheduler.
edulers
etousingtheMathWorksjobmanager,youcanusea
®
Windows HPC Server
Choosing Between a Scheduler and Job Manager
You should co MathWorks jo
Does your cl
If you alrea of controll as easy to u administr
Is the hand
manageme
The MathW paralle third-p
Is there
nsider the following when deciding to use a scheduler or the
b manager for distributing your tasks:
uster already have a scheduler?
dy have a scheduler, you may be required to use it as a means
ing a cces s to the cluster. Your existing schedule r might be just
se as a job manager, so there might be no need for the extra
ation involved.
ling of parallel computing jobs the only cluster scheduling
nt you n eed?
orks job manager is designed specifically for MathWorks l computing applications. If other scheduling tasks are not needed, a arty scheduler migh t not offer any advantages.
a file sharing configuration on your cluster already?
1-6
The Mat necess in conf
Are yo
When y all t work and s time
Are
You req
Ho
hWorks job manager can handle all file and data sharing
ary for your parallel computing applications. This might be helpful
igurations where shared access is limited.
u interested in batch or interactive processing?
ou use a job manager, worker processes usually remain running at
imes, dedicated to their job manager. With a third-party scheduler,
ers are run as a pplications that are started for the evaluation of tasks,
topped w hen their tasks are complete. If tasks are small or take little
, starting a worker for each one might involve too much overhead time.
there security concerns?
r scheduler may be configured to accommodate your particular security
uirements.
w many nodes are on your cluster?
Page 15
Toolbox and Server Components
Ifyouhavealargecluster,youprobably already have a scheduler. C onsult your MathWorks representative if you have questions about cluster size and the job manager.
Who administers your cluster?
The person administering your cluster might have a preference for how jobs are scheduled.
Components on Mixed Platforms or Heterogeneous Clusters
Parallel Computing Toolbo x software and MATL AB Distributed Computing Server software are supported on Windows Macintosh clients, job managers, and workers do not have to be on the same platform. The cluster can also be comprised of both 32-bit and 64-bit machines, so long as your data does not exceed the limitations posed by the 32-bit systems.
For a complete listing of all netw ork requirem en t s, incl ud i ng those for heterogeneous environments, see the System Requirements page for MATLAB Distributed Computing Server software at
®
operating systems. Mixed platforms are supported, so that the
®
,UNIX®(including Linux®), and
http://www.mathworks.com/products/distriben/requirements.html
In a mixed platform environment, be sure to follow the proper installation instructions for each local machine on which you are installing the software.
mdce Service
If you are using the MathWorks job manager, every machine that hosts a workerorjobmanagersessionmustalsorunthemdceservice.
The mdce service recovers worker and job manager sessions when their host machines crash. If a worker or job manager machine crashes, when mdce starts up again (usually configured to start at machine boot time), it automatically restarts the job manager and worker sessions to resume their sessions from before the system crash.
1-7
Page 16
1 Introduction
Using Parallel Computing Toolbox Software
A typical Parallel Computing Toolbox client session includes the following steps:
1 Find a Job Manager (or scheduler) — Your network may have one or more
job managers available (but usually only one scheduler). The function you usetofindajobmanagerorscheduler creates an object in your current MATLAB session to represent the job manager or scheduler that will run your job.
2 Create a Job — You create a job to hold a collection of tasks. The job exists
on the job manager (or scheduler’s data location), but a job object in the local MATLAB session represents that job.
3 Create Tasks — You create tasks to add to the job. Each task of a job can
be represented by a task object in your local MATLAB session .
4 Submit a Job to the Job Queue for Executio n — When your job has all its
tasks defined, you submit it to the queueinthejobmanagerorscheduler. Thejobmanagerorschedulerdistributes your job’s tasks to the worker sessions for evaluation. When all of the workers are completed with the job’s tasks, the job moves to the finished state.
1-8
5 Retrieve the Job’s Results — The resulting data from the evaluation of the
job is available as a property value of each task object.
6 Destroy the Job — When the job is complete and all its results are gathered,
you can destroy the job to free memory resources.
Page 17
Network Administration
This chapter provides information useful for network administration of Parallel Computing Toolbo x software and MATL AB Distributed Computing Server software.
“Preparing for Parallel Computing” on page 2-2
“Installing and Configuring” on page 2-5
“Using a Different MPI Build on UNIX Operating Systems” on page 2-6
“Shutting Down a Job Manager Configuration” on page 2-9
2
“Customizing Server Services” on page 2-13
“Accessing Service R ecord Files” on page 2-17
“Troubleshooting” on page 2-19
Page 18
2 Network Administration
Preparing for Parallel Computing
In this section...
“Before You Start” on page 2-2
“Planning Your Network Layout” on page 2-2
“Network Requirements” on page 2-3
“Fully Qualified Domain Names” on page 2-3
“Security Considerations” on page 2-4
This section discusses the requirements and configurations for your network to support parallel computing.
Before You Start
Before attempting to install Parallel Computing Toolbox software and MATLAB Distributed Computing Server software, read Chapter 1, “Introduction” to familiarize yourself with the concepts and vocabulary of the products.
2-2
Planning Your Network Layout
Generally, it is easy to decide which machines w ill run worker processes and which will run client processes. Worker sessions usually run on the cluster of machines dedicated to that purpose. The MATLAB client session usually runs where MATLAB programs are run, often on a user’s desktop.
The job manager process should run on a stable machine, with adequate resources to manage the number of tasks and amount of data expected in your parallel computing applications.
The follo wing table sho ws what products and process es are neede d for each of these roles in the p arallel computing configuration.
Page 19
Preparing for Parallel Computing
Session Product Processes
Client Parallel Computing
Toolbox
Worker
Job manager
Theserversoftwareincludesthemdceserviceordaemon. Themdceservice is separate from the worker and job manager processes, and it must be running on all machines that run job manager sessions or workers that are registered with a job manager. (The mdce service is not used with third-party schedulers.)
You can install both toolbox and server software on the same machine, so that one machine can run both client and server sessions.
MATLAB Distributed Computing Server
MATLAB Distributed Computing Server
MATLAB with toolbox
worker; mdce service (if using a job manager)
mdce service; job manager
Network Requirements
To view the network requirements for MATLAB Distributed Computing Server software, visit the product requirements page on the MathWorks Web site at
http://www.mathworks.com/products/distriben/requirements.html
Fully Qualified Domain Names
MATLAB Distributed Computing Server software and Parallel Computing Toolbox software support both short hostnames and fully qualified domain names. The default usage is short hostnames. If your network requires fully qualified hostnames, you can use the nodes by their full names. See “Customizing Server Services” on page 2-13. To set the hostname used for a MATLAB client session, see the reference page.
mdce_def file to identify the w orker
pctconfig
2-3
Page 20
2 Network Administration
Security Consid
The parallel com Therefore, be aw
MATLAB workers
mdce service u operating sys systems. Beca that execute
The mdce serv
Anyone with their worke
The job mana
data. Usin could allo
The parall
or you mus other thr job canno communi
If certa
computi
in ports are restricted, you can specify the ports used for parallel
puting products do not provide any security measures. are of the following security considerations:
nder. By default, the mdce service starts as
tems, and as
use MATLAB provides system calls, users can submit jobs
shell commands.
ice does not enforce any access control or authentication.
local or remote access to the mdce services can start and stop
rs and job managers, and query for their status.
ger does not restrict access to the cluster, nor to job and task
g a third-party scheduler instead of the MathWorks job manager
w you to take advantage of the security measures it provides.
el computing processes must all be onthesamesideofafirewall,
ttakemeasurestoenablethemtocommunicatewitheach
ough the firewall. Workers running tasks of the same parallel
t be firewalled off from each other, because their MPI-based
cation will not work.
ng. See “Defining the Script Defaults” on page 2-13.
erations
run as whatever user the administrator starts the node’s
root on UNIX
LocalSystem on Microsoft Windows operating
2-4
If your
accomm networ betwe compu
If you
(MBo MBon
rally the default condition. If you have any questions about MBone
gene memb
network supports multicast, the parallel computing processes
odate m ulticast. However, because multicast is disabled on many
ks for security reasons, you might require unicast communication
en parallel computing processes. Most examples of parallel
ting scripts and functions in this documentation show unicast usage.
r organization is a member of the Internet Multicast Backbone
ne), make sure that your parallel computing cluster is isolated from
e access if you are using multicast for para llel computing. This is
ership, contact your network administrator.
Page 21
Installing and Configuring
To find the most up-to-date instructions for installing and configuring the current or past versions of the parallel computing products, visit the MathWorks Web site at
http://www.mathworks.com/support/product/DM/installation/ver_current/
Installing and Configuring
2-5
Page 22
2 Network Administration
Using a Different MPI Build on UNIX Operating Systems
In this section...
“Building MPI” on page 2-6
“Using Your MPI Build” on page 2-6
Building MPI
To use an MPI build that differs from the one provided with Parallel Computing Toolbox, this stage outlines the steps for creating an MPI build. If you already have an alternative MPI build, proceed to “Using Your MPI Build” on page 2-6.
1 Unpack the MPI sources into the target file system on your machine. For
example, suppose you have downloaded to unpack it into
# cd /opt # mkdir mpich2 && cd mpich2 # tar zxvf path/to/mpich2-distro.t gz # cd mpich2-1.0.8
/opt for building:
mpich2-distro.tgz and want
2-6
2 Build your MP I using the enable-sha redlibs option(thisisvital,asyou
must build a shared library MPI, binary compatible w ith for R2009b and later). For example, the following commands build an MPI with the
nemesis channel device and the gforker launcher.
# ./config ure -prefix=/opt/mpich2/mpich2-1.0.8 \
--enable-sharedlibs=gcc \
--with-device=ch3:nemesis \
--with-pm=gforker 2>&1 | tee log # make 2>&1 | tee -a log # make install 2>&1 | tee -a log
MPICH2-1.0.8
Using Your MPI Build
When your MPI build is ready, this stage highlights the steps to use it. To get the Parallel Computing Toolbox mpiex ec scheduler working with a different MPI build, follow these steps. Most of these steps are also needed if you want to use a different MPI build with third party-schedulers (LSF, generic).
Page 23
Using a Differen t MPI Build on UNIX®Operating Systems
1 Test your build by running the mpiexec executable. The build should be
ready to test if its
bin/mpiexec and lib/libmpich.so are available in the
MPI installation location.
Following the example in “Building MPI” on page 2-6,
/opt/mpich2/mpich2-1.0.8/bin/mpiexec and
/opt/mpich2/mpich2-1.0.8/lib/libmpich.so are ready to use, so you
can test the build with:
$ /opt/mpi ch2/mpich2-1.0.8/bin/mpiexec -n 4 hostname
2 Create an mpiLibConf function to direct Parellel Computing Toolbox to
useyournewMPI.Writeyour
mpiLibConf.m to return the appropriate
information for your build. For example:
function [ primary, ex tras ] = mpiLibConf primary = '/opt/mpich2 /mpich2-1.0.8/lib/libmpich.so'; extras = {};
The primary path must be valid on the cluster; and your
mpiLibConf.m file must be higher on the cluster workers’ path than matlabroot/toolbox/distcomp/mpi.(SendingmpiLibConf.m as a file
dependency for this purpose does not work. You can get the
mpiLibConf.m
function on the worker path by either mov ing the file into a directory on the path, or by having the scheduler use
cd in its command s o that it starts
the MATLAB worker from within the directory that contains the function.)
3 Determine necessary daemons and command-line options.
Determine all necessary daemons (often something like
smpd). The gforker build example in this section uses an MPI that needs
mpdboot or
no services or daemons running on the cluster, but it can use only the local machine.
Determine the correct command-line options to pass to
4 Use one of the following options to set up your scheduler to use your new
mpiexec.
MPI build:
For the simplest case of the mpiexec scheduler, set up a configuration
to use the that you use matching
mpiexec executable from your new MPI build. It is crucial
mpiexec, MPI library, and any daemons (if
2-7
Page 24
2 Network Administration
any), together. Set the configuration’s MpiexecFileName property to
/opt/mpich2/mpich2-1.0.8/bin/mpiexec.
If you are using a third-party scheduler (either fully supported or via
the generic interface), modify yourparallelwrapperscript to pick up the correct
mpiexec. Additionally, there may be a stage in the wrapper
script where the MPI daemons are launched.
The parallel submission wrapper script must:
Determine which nodes are allocated by the scheduler.
Start required daemon processes. For example, for the MPD process
manager this means calling
"mpdboot - f <nodefile>".
Define which mpiexec executable to use for starting workers.
Stop the daemon processes. For example, for the MPD process
manager this means calling
"mpdallexit".
For examples of parallel wrapper scripts, see
matlabroot/toolbox/distcomp/examples/integration/; specifically
for an example of Sun Grid Engine, look in the folder
sgeParallelWrapper.sh. Adopt and modify the appropriate script
sge for
foryourparticularclusterusage.
2-8
Page 25
Shutting Down a Job Manager Configuration
Shutting Down a Job Manager Configuration
In this section...
“UNIX and Macintosh Operating Systems” on page 2-9
“Microsoft Windows Operating Systems” o n page 2-11
If you a re done using the job manager and its workers, you might want to shut downtheserversoftwareprocessessothat they are not consuming network resources. You do not need to be at the computer running the processes that you are shutting down. You can run these commands from any machine with network access to the processes. The following sections explain shutting down the processes for different platforms.
UNIX and Macintosh Operating Systems
Enter the commands of this section at the prompt in a UNIX shell.
Stopping the Job Manager and Workers
1 To shut down the job manager, enter the comm ands
cd matlabroot/toolbox/distc omp/bin
(Enter the following command on a single line.)
stopjobmanager -remotehost <job m anag er hostname> -name <MyJobManager> -v
If you have more than one job manager running, stop each of them individually by host and name.
For a list of all options to the script, ty pe
stopjobmanager -help
2 For each MATLAB worker you want to shut down, enter the commands
cd matlabroot/toolbox/distc omp/bin stopworker -remotehost <worker hos tna me> -v
2-9
Page 26
2 Network Administration
Ifyouhavemorethanoneworkersessionrunning,youcanstopeachof them individually by host and name.
stopworker -name work er1 -remotehost <worker hostname> stopworker -name worke r2 -remotehost <worker hostname>
For a list of all options to the script, ty pe
stopworker -help
Stopping and Uninstalling the mdce Daemon
Normally, you configure the mdce daemon to start at system boot time and continue running until the machine shuts down. However, if you plan to uninstall the MATLAB Distributed Computing Server product from a machine, you might want to uninstall the mdce daemon also, because you no longer need it.
2-10
Note You must have root privileges to stop or uninstall the mdce daemon.
1 Use the following command to stop the mdce daemon:
/etc/init.d/mdce stop
2 Remove the installed link to prevent the daemon from starting up again
at system reboot:
cd /etc/in it.d/ rm mdce
Stopping the D aemon Manually. If you used the alternative manual startup of the mdce daemon, use the following commands to stop it manually:
cd matlabroot/toolbox/distc omp/bin mdce stop
Page 27
Shutting Down a Job Manager Configuration
Microsoft Windows Operating Systems
Stopping the Job Manager and Workers
Enter the commands of this section at the prompt in a DOS command window.
1 To shut down the job manager, enter the comm ands
cd matlabroot\toolbox\distc omp\bin
(Enter the following command on a single line.)
stopjobmanager -remotehost <job m anag er hostname> -name <MyJobManager> -v
If you have more than one job manager running, stop each of them individually by host and name.
For a list of all options to the script, ty pe
stopjobmanager -help
2 For each MATLAB worker you want to shut down, enter the commands
cd matlabroot\toolbox\distcomp\bin
stopworker -remotehost <worker hostname> -name <worker name> -v
Ifyouhavemorethanoneworkersessionrunning,youcanstopeachof them individually by host and name.
stopworker -remotehost <worker hostname> -name <worker1 name>
stopworker -remotehost <worker hostname> -name <worker2 name>
For a list of all options to the script, ty pe
stopworker -help
Stopping and Uninstalling the mdce Service
Normally, you configure the mdce service to start at system boot time and continue running until the machine shuts down. If you need to stop the mdce
2-11
Page 28
2 Network Administration
service while leaving the machine on, enter the following commands at a DOS command prompt:
cd matlabroot\toolbox\distc omp\bin mdce stop
If you plan to uninstall the MATLAB Distributed Computing Server product from a machine, you might want to uninstall the mdce service also, because you no longer need it.
You do not need to stop the service before uninstalling it.
To uninstall the mdce service, enter the following commands at a DOS command prompt:
cd matlabroot\toolbox\distc omp\bin mdce uninst all
2-12
Page 29
Customizing Server Services
In this section...
“Defining the Script Defaults” on page 2-13
“Overriding the Script Defaults” on page 2-15
The MATLAB Distributed Computing Server scripts run using several default parameters. You can customize the scripts, as described in this section.
Defining the Script Defaults
The scripts for the server services require values for several paramete rs. These parameters set the process name, the user name, log file location, ports, etc. Some of these can be set using flags on the command lines, but the full set o f user-configurable p arameters are in the
Note The startup script flags take precedence over the settings in the
mdce_def file.
Customizing Server Services
mdce_def file.
The default parameters used by the server service scripts are defined in the file:
matlabroot\toolbox\distcomp\bin\mdce_def.bat (on Microsoft
Windows operating systems)
matlabroot/toolbox/distcomp/bin/mdce_def.sh (on UNIX or Macintosh
operating systems)
To set the default parameters, edit this file before installing or starting the mdce service.
The
mdce_def file is self-documented, and includes explanations of all its
parameters.
2-13
Page 30
2 Network Administration
Note If you want to run more than one job manager on the same m achine,
they must all have unique names. Specify the names using flags with the startup commands.
Setting the User
By default, the job manager and worker services run as the user who starts them. You can run the services as a different user with the following settings in the
mdce_def file.
Parameter Description
MDCEUSER
Set this parameter to run the mdce services as a user different from the user who starts the service. On a UNIX operating system, set the value before starting the service; on a Windo ws operating system, set it before installing the service.
MDCEPASS
On a Windows operating system, set this parameter to specify the password for the user identified in the
MDCEUSER parameter; otherwise, the system prompts
you for the password when the service is installed.
2-14
On UNIX operating systems, MDCEUSER requires that the current machine has the
sudo to execute commands as the user identified by MDCEUSER.Forfurther
information, refer to your system documentation on the
sudo utility installed, and that the current user be allowed to use
sudo and sudoers
utilities (for example, man sudo and man sudoers).
On Windows operating systems, when executing the the user defined by
MDCEUSER must be listed among those who can log
mdce start script,
on as a service. To see the list of valid users, select the Windows menu Start > Settings > Control Panel. Double-click then
Local Secur ity Policy. In the tree, select User Rights Assignment,
then in the right pane, double-click must list the user defined for
Log on as a service. This dialog box
MDCEUSER in your mdce_def.bat file. If not,
Administrative Tools,
you can add the user to this dialog box according to the instructions in the
mdce_def.bat file, or when running mdce start, you can use another
mdce_def.bat file that specifies a listed user.
Page 31
Customizing Server Services
Overriding the Script Defaults
Specifying an Alternative Defaults File
The default parameters used by the mdce service, job managers, and workers are defined in the file:
matlabroot\toolbox\distcomp\bin\mdce_def.bat (on Windows
operating systems)
matlabroot/toolbox/distcomp/bin/mdce_def.sh (on UNIX or Macintosh
operating systems)
Before installing and starting the mdce service, you can edit this file to set the default parameters with values you require.
Alternatively, you can make a copy of this file, modify the copy, and specify that this copy be used for the default parameters.
On UNIX or Macintosh operating systems, enter the command
mdce start -mdcedef my_mdce_def.s h
On Windows operating systems, enter the command
mdce insta ll -mdcedef my_mdce_def.bat mdce start -mdcedef my_mdce_def.ba t
If you specify a new mdce_d ef file instead of the default file for the service on onecomputer,thenewfileisnotautomaticallyusedbythemdceserviceon other computers. If you want to use the same alternative file for all your mdce services, you must specify it for each mdce service you install or start.
For more information, see “Defining the Script Defaults” on page 2-13.
Note The startup script flags take precedence over the settings in the
mdce_def file.
2-15
Page 32
2 Network Administration
Starting in a Clean State
When a job manager or worker starts up, it normally resumes its session from the past. This way, a job queue is not destroyed or lost if the job manager machine crashes or if the job manager is inadvertently shut down. To start up a j ob manager or worker from a clean state, with all history deleted, use the
-clean flag on the start command:
startjobmanager -clean -name MyJo bMan ager startworker -clean -jo bmanager MyJobManager
2-16
Page 33
Accessing Service Record Files
In this section...
“Locating Log Files” on page 2-17
“Locating Checkpoint Directories” on page 2-18
The MATLAB Distributed Computing Server services generate various record files in the normal course of their operations. The mdce service, job manager, and worker sessions all generate such files. This section describes the types of information stored by the services.
Locating Log Files
Log files for each service contain entries for the service’s operations. These might be of particular interest to the network administrator in cases when problems arise.
Operating System File Loc ation
Accessing Service Record Files
Windows
UNIX and Macintosh
The default location of the log files is
<TEMP>\MDCE\Log,where<TEMP> is the value
of the system
TEMP is set to C:\TEMP, the log files are placed
in
C:\TEMP\MDCE\Log.
You can set a lte rnativ e locations for the log files by modifying the
mdce_def.bat file before starting the mdce
service.
The default location of the log files is
/var/log/mdce/.
You can set a lte rnativ e locations for the log files by modifying the
mdce_def.sh file before starting the mdce
service.
TEMP variable. For example, if
LOGBASE setting in the
LOGBASE setting in the
2-17
Page 34
2 Network Administration
Locating Checkp
Checkpoint dire the server servi another. For ex continues the o
A primary feat This allows se system goes d if a MATLAB wo is neither re finished jo any unfinis
Note If a jo minutes to
Platform File Location
Windows
ctories contain information related to persistence data, which
ces use to create continuity from one instance of a session to
ample, if you stop and restart a job manager, the new session
ld session, using all the same data.
ure offered by the checkpoint directories is in crash recovery.
rver services to automatically resume their sessions after a
own and comes back up, minimizing the loss of data. However,
rker goes down during the evaluation of a task, that task
evaluated nor reassigned to another worker. In this case, a
b may not have a complete set of o utput data, because data from
hed tasks might be missing.
b manager crashes and restarts, its workers can take up to 2
reregister with it.
oint Directories
The default location of the checkpoint directories is where
TEMP variable. For example, if TEMP is set to C:\TEMP, the checkpoint directories are placed
in
<TEMP> is the value of the system
C:\TEMP\MDCE\Checkpoint.
<TEMP>\MDCE\Checkpoint,
2-18
UNIX and Macintosh
You can set alternative locations for the checkpoint directories by modifying the
CHECKPOINTBASE setting in the mdce_def.bat
file before starting the mdce service.
The checkpoint directories are placed by default in
/var/lib/mdce/.
You can set alternative locations for the checkpoint directories by modifying the
CHECKPOINTBASE setting in the mdce_def.sh
file before starting the mdce service.
Page 35
Troubleshooting
In this section...
“License Errors” on page 2-19
“Memory Errors on UNIX Operating Systems” on page 2-21
“Running Server Processes from a Windows Network Installation” on page 2-21
“Required Ports” on page 2-21
“Ephemeral TCP Ports with Job Manager” on page 2-23
“Host Communications Problems” on page 2-23
“Verifying Multicast Communications” on page 2-25
This section offers advice on solving problems you might encounter with MATLAB Distributed Computing Server software.
Troubleshooting
License Errors
When starting a MATLAB worker, a licensing problem m ight result in the message
License ch eckout fail ed. No such FEATURE exists. License Man ager Error -5
There are many reasons why you might receive this error:
This message usually indicates that you are trying to use a product for
which you are not licensed. Look at y our your MATLAB installation to see if you are licensed to use this product.
If you are licensed for this product, this error may be the result of having
extra carriage returns or tabs in your license file. To avoid this, ensure that each line begins with either
After fixing your MATLAB should work properly.
This error may als o be the result of an incorrect system date. If your system
date is before the date that your license was made, you will get this error.
license.dat file, restart your license manager and
#, SERVER, DAEMON,orINCREMENT.
license.dat file located within
2-19
Page 36
2 Network Administration
If you receive this error when starting a worker with MATLAB Distributed
Computing Server software:
- You may be calling the sta rtwo rker command from an installation that
does not have access to a worker license. For example, starting a worker from a client installation of the Parallel ComputingToolboxproduct causes the following error:
The mdce service on the host hostname returned th e following error:
Problem sta rting the MATLAB worker.
The cause of this problem is:
==============================================================
Most likely , the MATLAB worker failed to start due to a licensing p roblem, or MATLAB crashed during startup. Check the worker log file /tmp/mdce_user/node_node_worker_05-11-01_16-52-03_953.log for more detailed information. The mdce log file /tmp/mdce_user/mdce-service.log may also contain some additional information.
===============================================================
2-20
In the worker log files, you see the following information:
License ch eckout fail ed. License Man ager Error -15 MATLAB is unable to connect to the li cense server. Check that the license manager has been started, and that the MATLAB clie nt machine can communicate with the license server.
Troubleshoot this issu e by visiting: http://www.mathworks.com/support/lme/R2009a/15
Diagnostic Information: Feature: MA TLAB_Distrib_Comp_Engine License pat h: /apps/matlab/etc/license.dat FLEXnet Lic ensing erro r: -15,570. System Error: 115
Page 37
Troubleshooting
- If you installed only the Parallel Computing Toolbox product, and you
are attempting to run a worker on the same machine, you will receive this error because the MATLAB Distributed Computing Server product is not installed, and therefore the worker cannot obtain a license.
Memory Errors on UNIX Operating Systems
If the number of threads created by the server services on a machine running a UNIX operating system exceeds the limitation set by the services fail and generate an out-of-memory error. Check your on a UNIX operating system with the UNIX software might have different names for this property.)
limit command. (Different versions of
Running Server Processes from a Windows Network Installation
Many networks are configured not to allow LocalSystem to have access to UNC or mapped network shares. In this case, run the mdce process under a different user with rights to log on as a service. See “Setting the User” on page 2-14.
maxproc value , the
maxproc value
Required Ports
Using a Job Manager
BASE_PORT. The mdce_def file specifies and describes the ports required
by the job manager and all workers. See the following file in the MATLAB installation used for each cluster process:
matlabroot/toolbox/distcomp/bin/mdce_def.sh (on UNIX operating
systems)
matlabroot\toolbox\distcomp\bin\mdce_def.bat (on Windows
operating systems)
Parallel Jobs. On worker machines running a UNIX operating system, the number of ports required by MPICH for the running of parallel jobs ranges from
BASE_PORT + 1000 to BASE_PORT + 2000.
2-21
Page 38
2 Network Administration
Using a Third-Party Scheduler
Before the worker processes start, you can control the range of ports used by the workers for parallel jobs by defining the environment variable
MPICH_PORT_RANGE with the value minport:maxport.
Client Ports
With the pctcon fig function, you specify the ports used by the client. If thedefaultportscannotbeused,thisfunction allows you to configure ports separately for communication with the job manager and communication with pmode or a MATLAB pool.
2-22
Page 39
Troubleshooting
Ephemeral TCP Po
If you use the job systems, you mus are available o ephemeral TCP p transfers of l particular, i maximum valid
1 Start the Reg
2 Locate the following subkey in the registry, and click Pa ra meters:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
3 On the Registry Editor window, select Edit>New>DWORDValue.
4 In the list of entries on th e right, change the new value name to
MaxUserPort and press Enter.
5 Right-cl
6 In the Edit DWORD Value dialog, enter 65534 in the Value data field.
Select
ick on the
Decimal for the Base value. Click OK.
manager on a cluster of nodes running Windows operating tmakesurethatalargenumberofephemeralTCPports
n the job manager machine. By default, the maximum valid
ort number on a Windows operating system is 5000, but
arge data sets might fail if this setting is not increased. In
f your cluster has 32 or more w orkers, you should increase the
ephemeral TCP port number using the following procedure:
istry Editor.
rts with Job Manager
MaxUserPort entry name and select Modify.
This parameter controls the maximum port number that is used when a program requests any available user port from the system. T ypically, ephemeral (short-lived) ports are allocated between the values of 1024 and 5000 inclusive. This action allows allocation for port numbers up to 65534.
7 Quit t
8 Reboot your machine.
he Registry Editor.
Host Communications Problems
Ifaworkerisnotabletomakeaconnectionwithitsjobmanager,orifaclient session cannot find a job manager with the indicate communications problems between nodes.
findResource function, this might
2-23
Page 40
2 Network Administration
Using a Command Line I nterface
First, be sure that the machines in question agree on their IP resolutions. The IP address for a particular host should be the same for itself as it is from the perspective of another host. For example, if a process on to one on address for
hostA,findoutthehostA IP address for itself, then see what the IP
hostA is from hostB. They should be the same.
hostB cannot connect
If the machines can identify each other, the
nodestatus command can be
useful for diagnosing problems between their processes. Use the function to determine what MDCS processes are running on the local host, and which are accessible from remote hosts. If a worker on its job manager on can see on
On
hostB,execute:
nodestatus -remotehose hostB
hostB.
hostB,runnod estatus on both hosts to see what each
hostA cannot register with
Then on hostA, run exactly the same command:
nodestatus -remotehose hostB
The results should be the same, showing the same listing of job managers and workers.
If the output indicates problems, run the command again with a higher information level to receive more detailed information:
nodestatus -remotehose hostB -inf olev el 3
Using a GUI
You can diagnose some communications problems using Admin Center.
2-24
If you cannot successfully add hosts to the listing by specifying host name, you can use their IP addresses instead (see “Adding Hosts” on page 3-3). If you suspect any communications problems, in the Admin Center GUI click Test Connectivity (see “Testing Connectivity” on page 3-9). This testing verifies that the nodes can identify each other and allow their processes to communicate with each other.
Page 41
Troubleshooting
Verifying Multicast Communications
Note Although the current version of the parallel computing products
continues to support multicast communications between its processes, multicast is not recommended and might not be supported in future releases.
Multicast, unlike TCP/IP or UDP, is a subscription-based protocol where a number of machines on a network indicate to the network their interest in particular packets originating somewhere on that network. By contrast, both UDP and TCP packets are always bound for a single machine, usually indicated by its IP address.
The main tools for investigating this type of packet are:
tcpdump for UNIX operating systems
winpcap and ethereal for Microsoft Windows operating systems
A Java™ class included with the parallel computing products.
The Java class is called
com.mathworks.toolbox.distcomp.test.MulticastTester.Bothitsstatic
main method and its constructor take two input arguments: the multicast group to join and the port numbe r to use.
This Java class has a number of simple methods to attempt to join a specified multicast group. Once the class has successfully joined the group, it has methods to send messages to the group, listen for messages from the group, and display what it receives. You can use this class both from a command-line call to Java software and inside MATLAB.
From a shell prompt (assuming that
java -cp distcomp.jar com.mathworks.toolbox.distcomp.test.MulticastTester
java is on your path), type
You should see an output something like this:
0 : host1name : 0 1 : host2name : 0
2-25
Page 42
2 Network Administration
The following example shows how to use the Java class inside MATLAB.
Start MATLAB on two machines (e.g.,
host1name and host2name)forwhich
youwanttotestmulticast. IneachMATLABsession,enterthefollowing commands:
m = com.mathworks.toolbox.distcomp.test.MulticastTester('239.1.1.1', 9999);
m.startSendingThread;
m.startListeningThread;
These instructions cause each MATLAB session to issue a stream of multicast test packets, and to listen for test packets. If multicast is working between the machines, you see a stream of lines like the following:
0 : host1name : 0 1 : host2name : 0 2 : host2name : 1 3 : host2name : 2
The number on the left in each string is the line number for the received packet. The text in the center is the host from which the packet is received. The number on the right is the packet number sent by the sending host. It is normal for a host to report a test packet from itself.
If either machine does not receive a stream of test packets, or if the remote host is not included in either stream, then multicast communication is not operating properly.
2-26
To terminate the test stream, execute the following in both MATLA B sessions:
m.stopSendingThread;
m.stopListeningThread;
Page 43
Admin Center
“Starting Admin Center” on page 3-2
“Setting Up Resources” on page 3-3
“Testing Connectivity” on page 3-9
“Saving and Loading Sessions” on page 3-13
“Preparing for User Configurations” on page 3-14
3
Page 44
3 Admin Center
Starting Admin Center
Admin Center is a graphical user interface that lets you control and verify MATLAB Distributed Computing Server resources if you are using a job manager as your scheduler.
You must start Admin Center outside a MATLAB session by executing the following:
matlabroot/toolbox/distcomp/bin/admincenter (on UNIX operating
systems)
matlabroot\toolbox\distcomp\bin\admincenter.bat (on Microsoft
Windows operating systems)
The first time you start A dmin Center, you see a welcome dialog box.
3-2
A new session has no hosts listed, so the usual first step is to identify the hosts you want to include in your listing. To do this, click Add or Find. Further information continues in the next section.
If you start Admin Center again on the same host, your previous session for that machine is loaded; and unless the update rate is set to Center performs an update immediately for the listed hosts and processes. To clear this information and start a new session, select the pull-down File > New Session.
never,Admin
Page 45
Setting Up Resources
In this section...
“Adding Hosts” on page 3-3
“Starting a Job Manager” on page 3-4
“Starting Workers” on page 3-5
“Stopping, Destroying, Resuming, Restarting Processes” on page 3-7
“Moving a Worker” on page 3-8
“Updating the Display” on page 3-8
Adding Hosts
To specify the ho sts you want displayed in Admin Center, click Add or Find in the Welcome dialog box, or if this is not a new session, click Add or Find in the Hosts module.
Setting Up Resources
In the Add or Find Hosts dialog box, identify the hosts you want to add to the listing by one of the following methods:
Select Enter Hostnames and provide short host names, fully qualified
domain names, or individual IP addresses for the hosts, or
Select Enter IP Range and provide the range of IP addresses for your
hosts.
Note While you can add any hosts to Admin Center, a host must be running themdceserviceifajobmanagerorworkeristorunonthathost. Seethe installation instructions available at:
http://www.mathworks.com/support/product/DM/installation/ver_current/
If one of the hosts you have specified is running a job manager, Admin Center will automatically find and list all the hosts running workers registered with that job manager. Similarly, if you specify a host that is running a worker,
3-3
Page 46
3 Admin Center
Admin Center will find and list the host running that worker’s job manager, and also all hosts running other workers under that job manager.
3-4
Starting a Job Manager
To start a job manager, click Start in the Job Manager module.
In the New Job Manager dialog box, provide a name for the job manager, and select a host to run it on.
Page 47
Setting Up Resources
Alternative methods for starting a job manager include selecting the pull-down Job Manager > Start, or right-clicking a listed host and selecting, Start Job Manager.
With a job manager running on your cluster, Admin Center might look like the following figure, with the job manager listed in the Job Manager module, as well as being listed by name in the Hosts module in the line for the host on which it is running.
Starting Workers
To start MATLAB workers, click Start in the Workers module.
In the Start Workers dialog box, specify the numbers of workers to start on each host, and select the hosts to run them. From the list, select the job manager for these workers. Click OK to start the workers. Admin center
3-5
Page 48
3 Admin Center
automatically provides names for the workers, based on the hosts running them.
Alternative methods for starting workers include selecting the pull-down
Workers > Start, or right-cl ick in g a listed host or job manager and selecting, Start Workers.
3-6
With workers running on your cluster, Admin Center might look like the following figure, which shows the workers listed in the Workers module. Also, the number of workers running under the job manager is listed in the Job Manager module, and the number of workers for each job manager is listed in the Hosts module.
Page 49
Setting Up Resources
To get more information on any host, job manager, or worker listed in Admin Center, right-click its name in the display and select Properties. Alternatively, you can find the Properties option under the Hosts, Job Manager,andWorkers drop-down menus.
Stopping, Destroying, Resuming, Restar ting Processes
You can Stop or Destroy job managers an d workers. The primary difference is that stopping a process shuts it down but retains its data; destroying a processshutsitdownandclearsitsdata. UseResume to have a process continue with its existing data. When you use Restart, a dialog box requires you to confirm your intention of starting a new process while keeping or discarding data.
3-7
Page 50
3 Admin Center
Moving a Worker
To move a worker f than start a new w
rom one host to another, you must completely shut it down, orker on the desired host:
1 Right-click th
2 Select Destroy. This shuts down the worker process and removes all its
data.
3 If the old worker host is not running any other MDCS processes (mdce
service, job manager, or workers), you might want to remove it from the Admin Center listing.
4 If necessar
5 In the Workers module, click Start. Select the desired host in the Start
Workers dialog box, along with the appropriate number and job manager name.
Use a similar process to move a job manager from one host to another. Note, however, that all workers registered w ith the job manager must be destroyed and started again, registering them with the new instance of the job manager.
e worker in the Workers module list.
y, add the new host to the Admin Center host listing.
Updating the Display
Admin C enter updates its data automatically at regu l ar intervals. To set the update rate, select an option from the Update list. Click Update Now to immediately update the display data.
3-8
Page 51
Testing Connectivity
Admin Center lets you test communications between your job manager node, worker nodes, and the node where Admin Center is running.
The tests are divided into four categories:
Client — Verifies that the node running Admin Center is properly
configured so that further cluster testing can proceed.
Client to Nodes — Verifies that the node running Admin Center can
identify and communicate w ith the other nodes in the cluster.
Nodes to Nodes — Verifies that the other nodes in the cluster can identify
each other, and that each node allows its mdce service to communicate with the mdce service on the other cluster nodes.
Nodes to Client — Verifies that other cluster nodes can identify and
communicate with the node running Admin Center.
First click Test Connectivity to open the Connectivity Testing dialog box. By default, the dialog box displays the results of the last test. To run new tests and update the display, click Run.
Testing Connectivity
3-9
Page 52
3 Admin Center
During test execution, Admin Center displaysthisprogressdialogbox.
3-10
When the t closes, dialog b
and Admin Center displays the test resul ts in the Connectivity Testing
ox.
ests are complete, the Running Tests dialog box automatically
Page 53
The possible test result symbols are described in the following table.
Testing Connectivity
Test Result
Description
Test passed.
Test passed, extra information is available.
Test passed, but generated a warning.
Test failed.
Test was skipped, possibly because prerequisite tests did not pass.
Test that inclu de failures or other results might look like the following figure.
Double-click any of the symbols in the test results to drill down for more detail. Use the Log tab to see the raw data from the tests.
3-11
Page 54
3 Admin Center
The results of the tests that run on only the client are displayed in the lower-left corner of the dialog box. To drill into client-only test results, click
More Info.
3-12
Page 55
Saving and Loading Sessions
By default, Admin Center saves the cluster definition, process status, and test res ults, so the next time the same user runs Admin Center on the same machine, that information is available and displayed by default. You can export session data so that a different user or a different host can access it, by selecting the pull-down File > Export. Browsetothelocationwhereyou want to store the session data and provide a name for the file. Admin Center applies the extension
You can import that saved session data into a subsequent session of Admin Center by selecting the pull-down File > Import. The imported data includes cluster definition and test results.
When identifying the file for importing in the Import Session dialog box, there is a Disable updates check box. Checking this box lets you import a session that does not automatically update, so that you can statically examine a cluster setup for evaluation or diagnostic purposes. Otherwise, unless the update rate is set to after starting or loading a session.
.mdcs tothefilename.
never, Admin Center performs an update immediately
Saving and Loading Sessions
3-13
Page 56
3 Admin Center
Preparing for User Configurations
Admin Center does not create user configurations, but the information displayed in Admin Center is of vital importance when you create your parallel configuration — information such as job manager name, job manager host, and number of workers. For more information about creating and using configurations, see “Programming with User Configurations” in the Parallel Computing Toolbox d ocumentation.
3-14
Page 57
Control Script Reference
mdce P rocess Control (p. 4-2) Control mdce service
Job Manager Control (p. 4-2) Control job manag er
Worker Control (p. 4-2) Control MATLAB workers
4
Page 58
4 Control Script Reference
mdce Process Control
mdce Install, start, stop, or uninstall mdce
service
nodestatus
remotemdce Execute mdce
Job Manager Control
startjobmanager
stopjobmanager
Worker
Control
startw
stopw
orker
orker
Status of mdce processes running on node
command on one or more remote h protocol
Start job manager process
Stop job manager process
Start MATLAB worker session
Stop MAT LAB worker session
osts by transport
4-2
Page 59
Control Scripts — Alphabetical List
5
Page 60
mdce
Purpose Install, start, stop, or uninstall mdce service
Syntax mdce ins tall
mdce uni nstall mdce sta rt mdce sto p mdce con sole mdce res tart mdce ... -mdcedef <mdce_default s_fi le> mdce ... -clean mdce sta tus mdce -ve rsion
Description The mdce service ensures that all other processes are running and that
it is poss ible to communicate with them. Once the mdce service is running, you can use the about the mdce service and all the processes it maintains.
The
mdce executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating system ) or matlabroot/toolbox/distcomp/bin (UNIX operating system). Enter
the following commands at a DOS or UNIX command-line prompt, respectively.
nodestatus command to obtain information
5-2
mdce ins tall installsthemdceserviceintheMicrosoftWindows
Service Control Manager. This causes the service to automatically start when the Windo ws operating system boots up. The service must be installed before it is started.
mdce uni nstall uninstalls the mdce service from the Windows Service
Control Manager. Note that if you wish to install mdce service as a different user, you must first uninstall the service and then reinstall as the new user.
mdce sta rt starts the mdce service. This creates the required logging
and checkpointing directories, and then starts the service as specified in the m dce defaults file.
Page 61
mdce
mdce sto p stops running the mdce service. This automatically stops all
job managers and workers o n the computer, but leaves their checkpoint information intact so that they w ill start again when the mdce service is started again.
mdce con sole starts the mdce service as a process in the current
terminal or command window rather than as a service running in the background.
mdce res tart performs the equivalent of mdce stop followed by mdce
. This command is available only on UNIX and Macintosh
start
operating systems.
mdce ... -mdcedef <mdce_defaults_file> uses the specified
alternativemdcedefaultsfileinsteadoftheonefoundin
matlabroot/toolbox/distcomp/bin.
mdce ... -clean performs a complete cleanup of all service
checkpoint and log files before installing or starting the service, or after stopping or uninstalling it. This deletes all information about any job managers or workers this service has ever maintained.
mdce sta tus reports the status of the mdce service, indicating
whether it is running and with what PID. Use more detailed information about the mdce service. The
nodestatus to obtain
mdce stat us
command is available only on UNIX and Macintosh operating systems.
mdce -ve rsion prints version information of the mdce process to
standard output, then exits.
See Also nodestatus, startjobmanager, startwor ker, stopjobmanager,
stopworker
5-3
Page 62
nodestatus
Purpose Status of mdce processes running on node
Syntax nodestatus
nodestatus -flags
Description nodestatus displays the status of the mdce service and the processes
which it maintains. The mdce service must already be running on the specified computer.
The
nodestatus executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating
system) or system). Enter the following command syntax at a DOS or UNIX command-line prompt, respectively.
nodestatus -flags accepts the following input flags. Multiple flags
can be used together on the same command.
matlabroot/toolbox/distcomp/bin (UNIX operating
Flag
-remotehost <hostname>
-infolevel <level>
Operation
Displays the status of the mdce service and the processes it maintains on the specified host. The default value is the local host.
Specifies how much status information to report, using a level of 1-3. 1 means only the basic information, 3 means all information available. The default value is 1.
5-4
Page 63
nodestatus
Flag
-baseport < port_number>
-v
Operation
Specifies the base port that the mdce service on the remote host is using. You need to specify this only if the value of the local match the base port being used by the mdce service on the remote host.
Verbose mode displays the progress of the command execution.
mdce_def file does not
BASE_PORT in
Examples Display basic information about the mdce processes on the local host.
nodestatus
Display detailed information about the status of the mdce processes on host
node27.
nodestatus -remotehost node27 -in fole vel 2
See Also mdce, startjobmanager, startworker, stopjobmanager, st opwo rker
5-5
Page 64
remotemdce
Purpose Execute mdce command on one or more remote hosts by transport
protocol
Syntax remotemdce <mdce opt ions><flags><protocol op tions>
Description remotemdce allowsyoutoexecutethemdceserviceononeormore
remote hosts. For a description of the mdce service, see the reference page. The general form of the syntax is:
remotemdce <mdce opt ions><flags><protocol op tions>
Thefollowingtabledescribesthesupportedflagsandoptions. Theycan becombinedinthesamecommand. Notethatflagsareeachpreceded by a dash (
-).
mdce
Flags and Options
<mdce o ptio ns>
-matlabroot MATLABROOT_DIR
-remotehost HOST1[,HOST2[,...]
-remoteplatform { UNIX | WINDOWS }
-quiet
Operation
Options and arguments of the mdce command, such as
mdce reference page for a full list.
The MATLAB installation folder on the remote hosts, required only if the remote installation folder differs from the one on the local machine.
Specify the names of the hosts where you want to run the mdce command. Separate the host names by commas without anywhitespaces. Thisisamandatory argument.
Indicate the platform of the remote hosts. This option is required only if different from the local platform.
Prevent mdce from prompting the user for missing information. The command fails if all required information is not specified.
start, stop, etc. See the
5-6
Page 65
remotemdce
Flags and Options
-help
-protocol t ype
<protocol options>
Note IfyouareusingOpenSSHdonaMicrosoft Windows operating system, you can encounter a p roble m when using backslashes in path names for your command options. In most cases, you can work around this pro bl em by using forward slashes instead. For example, to specify the file
C:/temp/mdce_def.bat.
Operation
Print the help information.
Force the usage of a particular protocol type. Specifying a protocol type with all its required param eters also avoids interactive prompting and allows for use in scripts.
The supported protocol types are
rsh.
To get more information about one particular protocol type, enter
remotemdce -protocol type -help
Specify particular options for the protocol type being used.
C:\temp\mdce_def.bat, you should identify it as
ssh and
Examples Start mdce o n three remote machines of the same platform as the client:
remotemdce start -rem otehost hostA,hostB,hostC
Start mdce in a clean state on two UNIX operating system machines from a W indow s operating system machine, using the ssh protocol. Enter the following co mmand on a single line:
remotemdce start -cle an -matlabroot /usr/local/m atla b
-remotehost unixHost1,unixHost2 -remoteplatform UNIX
5-7
Page 66
remotemdce
See Also mdce
-protocol ssh
5-8
Page 67
startjobmanager
Purpose Start job manager process
Syntax startjobmanager
startjobmanager -flags
Description startjobmanager starts a job manager process and the associated
job manager lookup process under the mdce service, which maintains them after that. The job manager handles the storage of jobs and the distribution of tasks contained in jobs to MATLAB workers that are registered with it. The mdce service must already be running on the specified computer.
The
startjobmanager executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating system ) or matlabroot/toolbox/distcomp/bin (UNIX operating system). Enter
the following command syntax at a DOS or UNIX command-line prompt, respectively.
startjobmanager -flags accepts the following input flags. Multiple
flags can be used together on the same command.
Flag
-name <job_ manager_name>
-remotehost <hostname>
Operation
Specifies the name of the job manager. This identifies the job manager to MATLAB worker sessions and MATLAB clients. The default is the value of the
DEFAULT_JOB_MANAGER_NAME parameter in
the
mdce_def file.
Specifies the name of the host where you want to start the job manager and the job manager lookup process. If omitted, they are started on the local host.
5-9
Page 68
startjobmanager
Flag
-clean
-multicast
-baseport < port_number>
-v
Operation
Deletes all checkpoint information stored on disk from previous instances of this job manager before starting. This cleans the job managersothatitinitializeswithnojobs or tasks.
Overrides the use of unicast to contact the job manager lookup process. It is recommended that you not use unless you are certain that multicast works on your network. This overrides the setting of
JOB_MANAGER_HOST in the mdce_def file
on the remote host, which would have the job manager use unicast. If this flag is omitted and the job manager uses unicast to contact the job manager lookup process running on the same host.
Specifies the base port that the mdce service on the remote host is using. You need to specify this only if the value of the local base port being used by the mdce service on theremotehost.
Verbosemodedisplaystheprogressofthe command execution.
JOB_MANAGER_HOST is empty,
mdce_def file does not match the
-multicast
BASE_PORT in
Examples Start the job manager MyJobManager on the local host.
startjobmanager -name MyJobManager
Start the job manager MyJobManager on the host JMHost.
startjobmanager -name MyJobManager -remotehost JMHost
5-10
Page 69
startjobmanager
See Also mdce, nodestatus, startworker, stopjobmanager, stopworker
5-11
Page 70
startworker
Purpose Start MATLAB worker session
Syntax startworker
startworker -flags
Description startworker starts a MATL AB worker process under the mdce service,
which maintains it after that. The worker registers with the specified job manager, from which it will get tasks for evaluation. The mdce service m ust already be running on the specified computer.
The
startworker executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating
system) or system). Enter the following command syntax at a DOS or UNIX command-line prompt, respectively.
startworker -flags accepts the following input flags. Multiple flags
can be used together on the same comm and, except where noted.
matlabroot/toolbox/distcomp/bin (UNIX operating
Flag
-name <work er_name>
-remotehost <hostname>
-jobmanager <job_manager_name>
5-12
Operation
Specifies the name of the MATLAB worker. The default is the value o f the
DEFAULT_WORKER_NAME parameter in the mdce_def file.
Specifies the name of the computer where you want to start the MATLAB worker. If omitted, the worker is started on the local computer.
Specifies the name of the job manager this MATLAB w orker will receive tasks from. The default is the value of the
DEFAULT_JOB_MANAGER_NAME parameter
in the
mdce_def file.
Page 71
startworker
Flag
-jobmanagerhost <job_manager_hostname>
-multicast
-clean
-baseport < port_number>
-v
Operation
Specifies the host on which the job manager is running. The worker uses unicast to contact the job manager lookup process on that host to register with the job manager.
This overrides the setting of
JOB_MANAGER_HOST in the mdce_def
file on the w orker computer, which would also have the worker use unicast.
Cannot be used together with
-multicast.
If you are certain that multicast works on your network, you can force the worker to use multicast to locate the job manager lookup process by specifying
-multicast.
Note: If you are using this flag to change the settings of and restart a stopped worker, then you should also use the
-clean flag.
Cannot be used together with
-jobmanagerhost.
Deletes all checkpoint information associated with this worker nam e before starting.
Specifies the base port that the mdce service on the remote host is using. You only need to specify this if the value of
BASE_PORT in the local mdce_def file does
not match the base port being used by the mdce service on the remote host.
Verbose m ode displays the progress of the command execution.
5-13
Page 72
startworker
Examples Start a worker on the local host, using the default worker name,
registering with the job manager
startworker -jobmanager MyJobManager -jobmanagerhost JMHost
Start a worker on the host WorkerHost, using the default worker name, and registering with the job manager (The following command should be entered on a single line.)
startworker -jobmanager MyJobManager -jobmanagerhost JMHost
-remotehost WorkerHost
Start two workers, named worker1 and worker2, o n the host
WorkerHost, registering with the job manager MyJobManager that is
running on the host
JMHost. Notethattostarttwoworkersonthe
same computer, you must give them different names. (Each of the two commands below should be entered on a single line.)
startworker -name worker1 -remotehost WorkerHost
-jobmanager MyJobManager -jobmanagerhost JMHost
startworker -name worker2 -remotehost WorkerHost
-jobmanager MyJobManager -jobmanagerhost JMHost
MyJobManager on the host JMHost.
MyJobManager on the host JMHost.
See Also mdce, nodestatus, startjobmanager, stopjobmanager, stopworker
5-14
Page 73
stopjobmanager
Purpose Stop job manager process
Syntax stopjobmanager
stopjobmanager -flags
Description stopjobmanager stops a job manager that is running under the mdce
service.
The
stopjobmanager executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating system ) or matlabroot/toolbox/distcomp/bin (UNIX operating system). Enter
the following command syntax at a DOS or UNIX command-line prompt, respectively.
stopjobmanager -flags accepts the following input flags. Multiple
flags can be used together on the same command.
Flag
-name <job_ manager_name>
-remotehost <hostname>
-clean
Operation
Specifies the name of the job manager to stop. The default is the value of
DEFAULT_JOB_MANAGER_NAME
parameter the mdce_def file.
Specifies the name of the host where you want to stop the job manager and the associated job manager lookup process. The default value is the local host.
Deletes all checkpoint information stored on disk for the current instance of this job manager after stopping it. This cleans the job manager of all its job and task data.
5-15
Page 74
stopjobmanager
Flag
-baseport < port_number>
-v
Operation
Specifies the base port that the mdce service on the remote host is using. You need to specify this only if the value of the local match the base port being used by the mdce service on the remote host.
Verbose mode displays the progress of the command execution.
mdce_def file does not
BASE_PORT in
Examples Stop the job manager MyJobManager on the local host.
stopjobmanager -name MyJobManager
Stop the job manager MyJobManager on the host JMHost.
stopjobmanager -name MyJobManager -remotehost JMHost
See Also mdce, nodestatus, startjobmanager, startworker, stopworker
5-16
Page 75
stopworker
Purpose Stop MATLAB worker session
Syntax stopworker
stopworker -flags
Description stopworker stops a MATLAB worker process that is running under
themdceservice.
The
stopworker executable resides in the folder
matlabroot\toolbox\distcomp\bin (Windows operating
system) or system). Enter the following command syntax at a DOS or UNIX command-line prompt, respectively.
stopworker -flags accepts the following input flags. Multiple flags
can be used together on the same command.
matlabroot/toolbox/distcomp/bin (UNIX operating
Flag
-name <work er_name>
-remotehost <hostname>
-clean
Operation
Specifies the name of the MATLAB worker to stop. The default is the value of the
DEFAULT_WORKER_NAME parameter
in the
Specifies the name of the host where you w ant to stop the MATLAB worker. The default value is the local host.
Deletes all checkpoint information associa ted with this worker name after stopping it.
mdce_def file.
5-17
Page 76
stopworker
Flag
-baseport < port_number>
-v
Operation
Specifies the base port that the mdce service on the remote host is using. You need to specify this only if the value of the local match the base port being used by the mdce service on the remote host.
Verbose mode displays the progress of the command execution.
mdce_def file does not
Examples Stop the worker with the default name on the local host.
stopworker
Stop the worker with the default name, running on the computer
WorkerHost.
stopworker -remotehost WorkerHost
Stop the workers named worker1 and worker2, running on the computer
WorkerHost.
BASE_PORT in
stopworker -name work er1 -remotehost WorkerHost stopworker -name worke r2 -remotehost WorkerHost
See Also mdce, nodestatus, startjobmanager, startworker, stopjo bman ager
5-18
Page 77
Glossary
Glossary
CHECKPOINTBASE
Thenameoftheparameterinthe of the job manager and worker checkpoint directories.
checkpoint directory
Location where job manager checkpoint information and worker checkpoint information is stored.
client
The MATLAB session that defines and submits the job. This is the MATLAB session in which the programmer usually develops and prototypes applications. Also known as the MATLAB client.
client computer
The computer running the MATLAB client.
cluster
A collection of compu ters that are connected via a netw ork and intended for a common purpose.
mdce_def file that defines the location
coarse-grained application
An application for which run time is significantly greater than the communication time needed to start and stop the program. Coarse-grained dis tributed applications are also called embarrassingly parallel applications.
codistributed array
An array partitioned into segments,witheachsegmentresidinginthe workspace of a different lab.
Composite
An object in a MATLAB client session that provides access to data values stored on the labs in a MATLAB pool, such as the values of variables that are assigned inside an
computer
A system with one or more processors.
spmd statement.
Glossary-1
Page 78
Glossary
distributed application
The same application that runs independently on several nodes, possibly with different input parameters. There is no communication, shared data, or synchronization points between the nodes. Distributed applications can be either coarse-grained or fine-grained.
distributed computing
Computing with distributed applications, running the application on several nodes simultaneously.
distributed computing demos
Demonstration programs that use Parallel Computin g Toolbox software, as opposed to sequential demos.
DNS
Domain Name System. A system that translates Internet domain names into IP addresses.
dynamic licensing
The ability of a MATLAB worker or lab to employ all the functionality you are licensed for in the MATLAB client, while checking out only a server product license. When a job is created in the MATLAB client with Parallel Computing Toolbox software, the products for which the client is licensed will be available for all workers or labs that evaluate tasks for that job. This allows you to run any code on the cluster for which you are licensed on your MATLAB client, w ithout requiring extra licenses for the worker beyond that for the MATLAB Distributed Computing Server product. For a list of products that are not eligible for use with Parallel Computing Toolbox software, see
http://www.mathworks.com/products/ineligible_programs/.
Glossary-2
fine-grained application
An application for which run time is significa n tly l ess th an the communication time needed to start and stop the program. Compare to coarse-grained applications.
head node
Usually, the node of the cluster designated for running the job manager and license manager. It is often u seful to run all the nonworker-related processes on a single machine.
Page 79
heterogeneous cluster
A cluster that is not homogeneous.
homogeneous cluster
A cluster of identical machines, in terms of both hardware and software.
job
The complete large-scale operation to perform in MATLAB, composed of a set of tasks.
job manager
The MathWorks process that queues jobs and assigns tasks to workers. A third-party process that performs this function is called a scheduler. The general term “scheduler” can also refer to a job manager.
job manager checkpoint informa tio n
Snapshot of information necessary for the job manager to recover from a system crash or reboot.
Glossary
job manager database
The database that the job manager uses to store the information about its jobs and tasks.
job manager lookup process
The p roces s that allows clients, workers, and job managers to find each other. It starts automatically when the job manager starts.
lab
When workers start, they work independently by default. They can then connect to each other and work together as peers, and are then referred to as labs.
LOGDIR
Thenameoftheparameterinthe
mdce_def file that defines the
directory where logs are stored.
MathWorks job manager
See job manager.
Glossary-3
Page 80
Glossary
MATLAB client
See client.
MATLAB pool
A collection of labs that are reserved by the client for execution of
parfor-loops or spmd statements. See also lab.
MATLAB worker
See worker.
mdce
The service that has to run on all machines before they can run a job manager or worker. This is the server foundation process, making sure that the job manager and worker processes that it controls are always running.
Note that the program and service name is all lowercase letters.
mdce_def file
The file that defines all the defaults for the mdce processes by allowing you to set preferences or definitions in the form o f parameter values.
Glossary-4
MPI
Message Passing Interface, the means by which labs communicate with each other while running tasks in the same job.
node
Acomputerthatispartofacluster.
parallel application
The same application that runs on several labs simultaneously, with communication, shared data, or synchronization points between the labs.
private array
An array which resides in the work spaces of one or more, but perhaps not all labs. There might or might not be a relationship between the values of these arrays among the labs.
Page 81
random port
A random unprivileged TCP port, i.e., a random TCP port above 1024.
register a worker
The action that happens when both worker and job manager are started and the worker contacts job manager.
replicated array
An array which resides in the workspaces of all labs, and whose size and content are identical on all labs.
scheduler
The process, either third-party or the MathWorks job manager, that queues jobs and assigns tasks to workers.
spmd (single program multiple data)
A block of code that ex ecutes simultaneously on multiple labs in a MATLAB pool. Each lab can operate on a different data set or different portion of distributed data, and can communicate with other participating labs while performing the parallel computations.
Glossary
task
One segment of a job to be evaluated by a worker.
variant array
An array which resides in the workspaces of all labs, but whose con ten t differs on these labs.
worker
The MATLAB process that performs the task computations. A lso known as the MATLAB worker or worker process.
worker checkpoint information
Files required by the worker during the execution of tasks.
Glossary-5
Page 82
Glossary
Glossary-6
Page 83
Index
IndexA
administration
network 2-1
C
checkpoint directory
definition Glo ssary-1 locating 2-18
CHECKPOINTBASE
definition Glo ssary-1
clean state
starting services 2-16
client
definition Glo ssary-1 process 1-4
client computer
definition Glo ssary-1
cluster
definition Glo ssary-1
coarse-grained application
definition Glo ssary-1
Composite
definition Glo ssary-1
computer
definition Glo ssary-1
configuring MATLAB
Server™ 2-5
control scripts
customizing 2-13 defaults 2-13
mdce 5-2 nodestatus 5-4 remotemdce 5-6 startjobmanager 5-9 startworker 5-12 stopjobmanager 5-15 stopworker 5-17
®
Distributed Computing
D
distributed application
definition Glo ssary-2
distributed computing
definition Glo ssary-2
distributed computing demos
definition Glo ssary-2
DNS
definition Glo ssary-2
dynamic licensing
definition Glo ssary-2
F
fine-grained application
definition Glo ssary-2
H
head node
definition Glo ssary-2
heterogeneous cluster
definition Glo ssary-3 support 1-7
homogeneous cluster
definition Glo ssary-3
I
installing MATLAB®Distributed Computing
Server™ 2-5
J
job
definition Glo ssary-3
job manager
checkpoint information
definition Glo ssary-3
database
definition Glo ssary-3
Index-1
Page 84
Index
definition Glo ssary-3 logs 2-17 lookup process
definition Glo ssary-3 multiple on one machine 2-14 process 1-4 stopping
on UNIX or M acintosh 2-9
on Windows 2-11 versus third-party scheduler 1-6
L
lab
definition Glo ssary-3
log files
locating 2-17
LOGDIR
definition Glo ssary-3
M
MathWorks job manager.Seejob manager MATLAB client
definition Glo ssary-4
MATLAB pool
definition Glo ssary-4
MATLAB worker
definition Glo ssary-4
mdce (service)
definition Glo ssary-4
mdce control script 5-2 mdce_def file
definition Glo ssary-4
MPI
definition Glo ssary-4
N
network
administration 2-1
layout 2-2 preparation 2-2 requirements 2-3 security 2-4
node
definition Glo ssary-4
nodestatus control script 5-4
P
parallel application
definition Glo ssary-4
parallel computing products
server 1-4 toolbox 1-4 version 1-3
Parallel Computing Toolbox
using 1-8
platforms
supported 1-7
R
random port
definition Glo ssary-5
register a worker
definition Glo ssary-5
remotemdce control script 5-6
requirements 2-3
S
scheduler
definition Glo ssary-5
third-party 1-6 security 2-4 spmd
definition Glo ssary-5
startjobmanager control script 5-9 startworker control script 5-12 stopjobmanager control script 5-15
Index-2
Page 85
Index
stopworker control script 5-17
T
task
definition Glo ssary-5
third-party scheduler 1-6
versus job manager 1-6
troubleshooting
license errors 2-19 memory errors 2-21 verifying multicast 2-25 Windows n etwork installation 2-21
U
user
setting 2-14
W
worker
definition Glo ssary-5
process 1-4 worker checkpoint information
definition Glo ssary-5 workers
logs 2-17
stopping
on UNIX or Macintosh 2-9 on Windows 2-11
Index-3
Loading...