Managing Memory in Cray OpenSHMEMX
A subset of the SHMEM environment variables deal specifically with how
memory is allocated for use by SHMEM routines. This section describes
how to use these environment variables.
**The support for this environment variable is available only on x86_64
based Cray systems and not on Aarch64 systems.**
SHMEM classifies data objects as either local or remote, based on
whether the data object being referenced is owned by the PE (that is,
local to the PE) or owned by another PE (that is, remote to the PE).
It also classifies data objects based on the memory region in which
the object resides: either the data segment, the private heap, the
symmetric heap, or the stack. As arguments to SHMEM routines, local
objects must reside entirely within one of the four memory regions,
while remote objects must reside entirely within either the data
segment or symmetric heap.
To examine SHMEM memory usage, set the SHMEM_MEMINFO_DISPLAY
environment variable to a value of 1 or greater. Doing so causes SHMEM
initialization to produce a set of messages similar to this example.
LIBSMA INFO:
min PEs per node = 24 on nid 62
max PEs per node = 24 on nid 62
min nominal node size = 32768M = 32G on nid 62
max nominal node size = 32768M = 32G on nid 62
min boot_freemem = 32032M = 31G on nid 65
max boot_freemem = 32035M = 31G on nid 56
min initial_freemem = 31783M = 31G on nid 62
max initial_freemem = 31791M = 31G on nid 7
min current_freemem = 7532M = 7G on nid 62
max current_freemem = 7539M = 7G on nid 7
huge page size = 2048K
huge pages reserved =12000 = 24000M = 23G
min huge_page_freemem = 6416M = 6G on nid 62
max huge_page_freemem = 6708M = 6G on nid 6
min huge pages alloc=12125 = 24250M = 23G on nid 62
max huge pages alloc=12125 = 24250M = 23G on nid 62
-----------------------------------------------------------
memory size (decimal MiB)
region virtual address range per proc per node
------- ------------------------------ -------- --------
text 0x000000400000..0x000000501fe8 1M 24M
data 0x000000703000..0x000000745b68 0M 6M
bss 0x000000745b68..0x000006b6d190 100M 2403M
privheap 0x000006b6d190..0x000006b90000 0M 3M
symheap 0x030000000000..0x030038400000 900M 21600M
alltoall 0x0300384fda40..0x0300388fda40 4M 96M
team 0x0300388fdac0..0x0300388fdca0 0M 0M
stack 0x7ffffffe8000..0x7ffffffff000 0M 2M
--total 1005M 24134M
OS 735M
--total 24869M = 24G
Understanding the LIBSMA INFO
nominal node size
The nominal node size is the total physical memory on the
node, rounded to the nearest GiB. Since different compute
notes in the system may have different amounts of physical
memory, the minimum and maximum sizes across all nodes
allocated to your job are shown. If these values differ from
what you expected, examine your job submission resource
settings.
PEs per node
The number of PEs per node is the active number of processes
allocated to the node, not necessarily the number of
physical cores on the node. Variations in the number of PEs
per node normally indicate that the total number of PEs
allocated for the job is not an integer multiple of the
requested number of PEs per node, so there is at least one
node with a smaller number of active PEs, or a heterogeneous
set of nodes may have been allocated to the job where some
nodes have fewer cores than others. This does not cause a
problem for SHMEM memory allocation because fewer PEs per
node means more, not less, memory is available to each of
those PEs. The size of the allocated regions is still the
same across all PEs in the job, but is limited to the
minimum available memory across all PEs.
boot_freemem
This is the kernel-determined value of the amount of memory
available to programs on the node at boot time, before any
programs have run. This value remains unchanged while the
system is running. The value will vary at least slightly
from node to node. Large variations between the minimum and
maximum value may indicate a problem.
initial_freemem
This is the kernel-determined value of the amount of memory
that was free at the beginning of the startup of the current
job. The difference between boot_freemem and initial_freemem
is a measure of how much memory is not available to the
application, either because of processes from previous jobs
that for some reason are still running on the node or
possibly because of memory leaks on the node. A qualifier to
this is that from job run to job run, the kernel may be
temporarily holding onto some memory that it may later free
while the job is running. Significant differences (1 GB or
more) probably indicate a serious problem on that node.
current_freemem
This is the value of the amount of memory that was free at
the completion of SHMEM initialization. This value is based
on the kernel-determined value of /proc/current_freemem at
that time plus the amount of memory already allocated for
text, data, bss, symmetric heap, private heap, and stack.
Memory reserved (by the aprun -m option) but not allocated
for any of the above memory regions is still considered free
and is included in the current_freemem value. This memory is
available for use later in the application for growing of
the private heap or for stack variables. The value can be
expected to vary slightly from node to node.
huge_page_size
The size in bytes of huge pages for those memory regions
backed by huge pages.
huge_pages_reserved
The number of huge pages reserved and the size in bytes of
the memory backed by huge pages. The usual method for
reserving huge pages is by using the aprun -m size [h|hs]
parameter. See the aprun man page for more information.
huge_page_freemem
This value is the amount of free memory in large-enough
blocks to support the size of the huge pages. This value
takes into account that memory can get fragmented and that
the total amount of free memory in large-enough blocks may
be less than the total amount of free memory. This value
includes huge pages that have been reserved but not yet
allocated, so this is a critical value for determining how
many huge pages can be allocated. The difference between
current_freemem and huge_pages_freemem is a measure of how
much memory is fragmented.
huge_pages_alloc
This gives the values of both the number of huge pages
allocated and the corresponding amount of memory in bytes
(actually, in mebibytes). Pages may have been reserved but
not yet allocated, and because CLE supports dynamic
allocation of huge pages, the amount allocated may be more
than the amount reserved.
Note that the SHMEM symmetric heap is always backed by huge
pages and the full XT_SYMMETRIC_HEAP_SIZE amount is
considered allocated during SHMEM initialization. Therefore
memory allocated for the symmetric heap is no longer free in
the context of current_freemem or huge_page_freemem, but is
only available through shmalloc() calls.
A percentage (controlled by the SHMEM_FREEMEM_THRESHOLD environment
variable) of the current_freemem value is displayed because the value
at the time of SHMEM initialization does not reflect future growth of
the heap or stack during program execution and there is no way for
SHMEM to determine future growth. Therefore when initialization tries
to determine if the program will oversubscribe memory by adding up the
sizes of the four SHMEM memory regions, allowing allocation of 100% of
the current available memory at this time would very likely lead to
running out of memory later during execution.
Instead, SHMEM allows allocation of a percentage of memory using the
SHMEM_FREEMEM_THRESHOLD environment variable. On subsequent job
launches, you can increase or decreased this value based on your
knowledge of the program and experience running it. Since each node
allocated to a job runs its own instance of the operating system,
since each node may have different amounts of physical memory, and
since Linux memory management is highly dynamic and not strictly
deterministic, the amount of available memory on the node can vary
slightly from node to node, or sometimes greatly. Given that the SHMEM
programming model requires the size of the SHMEM regions to be the
same for each PE, a variation in the available memory from node to
node means that the minimum across all nodes is essentially all that
is available per node for all PEs.
The lower section of the message lines displays the virtual address
ranges for the four SHMEM memory regions. Addresses are in some cases
rounded to meet alignment requirements.
text The text segment is not, strictly speaking, a SHMEM memory
region, but is displayed here because it is an important
piece of the memory allocation picture. This includes
executable text and read-only data.
data The initialized read/write data area.
bss The uninitialized read/write data area. Taken together, the
data and bss regions comprise the SHMEM data segment.
privheap The private heap is the region of memory used primarily for
data objects allocated with calls to malloc(). The private
heap can grow as more memory is allocated. The value
displayed by SHMEM is the value at the time that SHMEM
initialization is complete, so it does not reflect any
growth of the heap later in the job. If the application
mallocs a significant amount of memory, this should be taken
into consideration when looking at current_freemem and
huge_page_freemem in the SHMEM display. SHMEM initialization
can not know how much the private heap will grow.
symheap The symmetric heap is the region of memory SHMEM has
registered with the network for data transfers of objects on
the symmetric heap. Data objects on the symmetric heap are
allocated for use by the program with calls to shmalloc() or
shpalloc(). This is the only valid way to allocate objects
from the symmetric heap. Use the XT_SYMMETRIC_HEAP_SIZE
environment variable to control the size of this region.
alltoall The region of symmetric memory used for the shmem_alltoall
routines. This is not part of the symmetric heap specified
by XT_SYMMETRIC_HEAP_SIZE. See the
SHMEM_ALLTOALL_SYMBUF_SIZE environment variable.
team The region of symmetric memory used for the SHMEM team
routines. This is not part of the symmetric heap specified
by XT_SYMMETRIC_HEAP_SIZE.
stack The SHMEM stack is the region of memory used for data
objects allocated on the stack. The stack can grow as
routines are entered and stack space is needed. The value
displayed by SHMEM is the value at the time that SHMEM
initialization is complete so it does not reflect any growth
of the stack later in the job. If the application uses a
significant amount of stack space, this should be taken into
consideration when looking at current_freemem and
huge_page_freemem in the SHMEM display. SHMEM initialization
cannot know how much the stack will grow.
The first --total line gives the sum of the four SHMEM memory regions
plus the text segment. It does not necessarily include all memory used
by the program during execution of the program. The program may cause
parts of the stack or heap to grow.
The size given for the OS is an estimate based on information provided
by /proc/boot_freemem on CLE 3.0 systems or later. This size basically
represents all of physical memory on the node not directly available
to the running program.
The second --total line gives the sum of all allocated memory on the
node at the time of SHMEM initialization. The purpose is to give a
rough idea of how much of the node's memory is being used and how much
more could be potentially used if needed.
The SHMEM memory regions are allocated for each and every PE. If there
is more than one active PE per node, the amount of memory allocated
per node is the per PE value times the number of active PEs per node,
so the display shows both on a per process basis and a per node basis.
The memory allocated to the OS is only on a per node basis.
Overcommitment of Memory
Because Cray XE systems do not have swap space that would allow
overcommitment of physical memory, SHMEM initialization attempts to
detect overcommitment. A process cannot request a total amount of
memory for the combined data, private heap, symmetric heap, and stack
segments in excess of the available free memory on the node divided by
the number of active processes on the node.
You will most likely want to use as much of the physical memory on the
node as possible for the program's statically and dynamically
allocated data. If the total of all of the memory regions per PE times
the number of active PEs per node exceeds the available physical
memory, a message like this one displays:
LIBSMA ERROR:
The total requested size for the data segment, stack,
SHMEM symmetric heap, and private heap per PE of 1500M,
times the number of PEs per node of 24 is 36015M. This
exceeds 27135M, which is 95% of the available memory that
is in blocks large enough to support a page size of 2048K.
Try per PE values for
datasegment + privheap + XT_SYMMETRIC_HEAP_SIZE + stack
that totals 1130M or less.
Or reduce the number of PEs per node.
Or try a smaller huge page size.
The sizes recommended in this message are guidelines, not guarantees,
but are likely to be safe. You must match the memory demands of the
program with the physical memory of the node and the sizes of the
SHMEM memory regions.
Out-of-Range Address Arguments
Data objects that are used as arguments to SHMEM routines must lie
entirely within the SHMEM memory regions. If this is not the case, a
message like this one is displayed:
LIBSMA ERROR: PE 0: put target 0x007fffff7fbb50 lies neither in data
segment nor symmmetric heap
remote dataseg [0x000000005bc000 .. 0x0000000063d000] - PE 0
remote symheap [0x002aaaab210000 .. 0x002aaaac311000] - PE 0
In this example the operation failed because the target of a put
operation must be a remote object, and the address is clearly not in
the range of either remote memory region.
If you need more information to diagnose and resolve the problem set
the SHMEM_MEMINFO_DISPLAY environment variable to display information
about how your job's memory is allocated. For example, doing so would
make it clear that the address in the error message shown above is for
an object on the stack, which is not allowed as the target for a put
operation.