This where the actual kernel is launched. Although all the parameters come from the integ_pair value that is passed in. For more information c.f. choose and launch_templatized_integrator. Calculate the number of systems that are fit into a block
Although theoretically system_per_block can be set to 1, for performance reasons it is better to set this value to something equal to the number of memory banks in the device. This value is directly related to SHMEM_CHUNK_SIZE and ENSEMBLE_CHUNK_SIZE. For optimal performance all these constants should be equal to the number of memory banks in the device.
Theoretical analysis: The load instruction is executed per warp. In Fermi architecture, a warp consists of 32 threads (CUDA 2.x) with consecutive IDs. A load instruction in a warp triggers 32 loads sent to the memory controller. Memory controller has a number of banks that can handle loads independently. According to CUDA C Programming Guide, there are no memory bank conflicts for accessing array of 64-bit valuse in devices with compute capabality of 2.x.
It is essential to set system_per_block, SHMEM_CHUNK_SIZE and ENSEMBLE_CHUNK_SIZE to the warp size for optimal performance. Lower values that are a power of 2 will result in some bank conflicts; the lower the value, higher is the chance of bank conflict.