Parallel programming technologies on hybrid architectures презентация

Октябрь 26, 2021

Главная
Информатика
Parallel programming technologies on hybrid architectures

Содержание

2. Goal: Efficient parallelization of complex numerical problems in computational physics HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Plan of
3. … TOP500 List – June 2014
4. Source: http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/ TOP500 List – June 2014
5. Source: http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/ TOP500 List – June 2014
6. «Lomonosov» Supercomputer , MSU >5000 computation nodes Intel Xeon X5670/X5570/E5630, PowerXCell 8i ~36 Gb DRAM 2
7. Custom languages such as CUDA and OpenCL Specifications • 2880 CUDA GPU cores • Peak precision
8. «Tornado SUSU» Supercomputer, South Ural State University, Russia 480 computing units (compact and powerful computing blade-modules)
9. At the end of 2012, Intel launched the first generation of the Intel Xeon Phi product
10. HybriLIT: heterogeneous computation cluster Суперкомпьютер «Ломоносов» МГУ CICC comprises 2582 Cores Disk storage capacity 1800 TB
11. 2x Intel Xeon CPU E5-2695v2 3x NVIDIA TESLA K40S 2x Intel Xeon CPU E5-2695v2 NVIDIA TESLA
12. Multiple CPU cores with share memory Multiple GPU What we see: modern Supercomputers are hybrid with
13. Parallel technologies: levels of parallelism In the last decade novel computational technologies and facilities becomes available:
14. In the last decade novel computational facilities and technologies has become available: MPI-OpenMP-CUDA-OpenCL... It is not
15. Problem HCE: heat conduction equation Initial boundary value problem for the heat conduction equation: D –
16. Problem HCE: computation scheme Locally one-dimensional scheme: reduction of a multidimensional problem to a chain of
17. Step 1: Difference equations (Ny-2) on x direction Step 2: Difference equations (Nx-2) on y direction
18. Problem HCE: parallelization scheme Parallel Parallel
19. Parallel Technologies
20. OpenMP realization of parallel algorithm
21. OpenMP (Open specifications for Multi-Processing) OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform
22. Compiler directive Library routines OpenMP (Open specifications for Multi-Processing) Use flag -openmp to compile using Intel
23. OpenMP realization: Multiple CPU cores that share memory Table 2. OpenMP realization problem 1: execution time
24. OpenMP realization: Intel® Xeon Phi™ Coprocessor Compiling: icc -openmp -O3 -vec-report=3 -mmic algLocal_openmp.cc –o alg_openmp_xphi Table
25. OpenMP realization: Intel® Xeon Phi™ Coprocessor Optimizations The KMP_AFFINITY Environment Variable: The Intel® OpenMP* runtime library
26. CUDA (Compute Unified Device Architecture) programming model, CUDA C
27. CUDA (Compute Unified Device Architecture) programming model, CUDA C Source: http://blog.goldenhelix.com/?p=374 Core 1 Core 2 Core
28. Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif CUDA (Compute Unified Device Architecture) programming model
29. Device Memory Hierarchy Registers are fast, off-chip local memory has high latency Tens of kb per
30. Function Type Qualifiers __global__ __host__ CPU GPU __global__ __device__ __global__ void kernel ( void ){ }
31. Threads and blocks HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT int tid = threadIdx.x + blockIdx.x * blockDim.x tid
32. Scheme program on CUDA C/C++ and C/C++ HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT
33. nvcc -arch=compute_35 test_CUDA_deviceInfo.cu -o test_CUDA –o deviceInfo Compilation Compilation tools are a part of CUDA SDK
34. Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation) Some GPU-accelerated Libraries
35. Problem HCE: parallelization scheme Parallel Parallel
36. Problem HCE: CUDA realization Initialization: parameters of the problem and the computational scheme are copied in
37. Table 1. CUDA realization: Execution time and Acceleration CUDA realization of parallel algorithm: efficiency of parallelization
38. Problem HCE : analysis of results
39. Hybrid Programming: MPI+CUDA: on the Example of GIMM FPEIP Complex GIMM FPEIP : package developed for
40. To solve a system of coupled equations of heat conductivity which are a basis of the
41. GIMM FPEIP: Logical scheme of the complex
42. Using Multi-GPUs
43. MPI, MPI+CUDA ( CICC LIT, К100 KIAM)
44. Hybrid Programming: MPI+OpenMP, MPI+OpenMP+CUDA The MultiConfigurationalTtimeDependnetHartree (for) Bosons method: PRL 99, 030402 (2007), PRA 77, 033613
45. Time-Dependent Schrödinger equation governs the physics of trapped ultra-cold atomic clouds To solve the Time-Dependent Many-Boson
46. All the terms of the Hamiltonian are under experimental control and can be manipulated 1D-2D-3D: Control
47. Two generic rgimes: (i) non-violent (under-a-barrier) and (ii) Explosive (over-a-barrier) Two generic regimes: (i) non-violent (under-a-barrier)
48. List of Applications Modern development of computer technologies (multi-core processors, GPU , coprocessors and other) require
50. Скачать презентацию

Слайд 2

Goal: Efficient parallelization of complex numerical problems in computational physics
HETEROGENEOUS COMPUTATIONS

TEAM, HybriLIT

Plan of the talk:
Efficient parallelization of complex numerical problems in computational physics
Introduction
Hardware and software
Heat transfer problem
II. GIMM FPEIP package and MCTDHB package
III. Summary and conclusion

Слайд 3

…
TOP500 List – June 2014

Слайд 4

Source:
http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014

Слайд 5

Source:
http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014

Слайд 6

«Lomonosov» Supercomputer , MSU
>5000 computation nodes
Intel Xeon X5670/X5570/E5630, PowerXCell 8i
~36 Gb DRAM
2 x

nVidia Tesla X2070 6 Gb GDDR5 (448 CUDA-cores)
InfiniBand QDR

Слайд 7

Custom languages such as CUDA and OpenCL
Specifications
• 2880 CUDA GPU cores
• Peak

precision floating point performance
4.29 TFLOPS single-precision
1.43 TFLOPS double-precision
• memory
12 GB GDDR5
Memory bandwidth up to 288 GB/s

NVIDIA Tesla K40 “Atlas” GPU Accelerator

Supports Dynamic Parallelism and HyperQ features

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT

Слайд 8

«Tornado SUSU» Supercomputer, South Ural State University, Russia
480 computing units (compact and powerful

computing blade-modules)
960 processors Intel Xeon X5680
(Gulftown, 6 cores with frequency 3.33 GHz)
384 coprocessors Intel Xeon Phi SE10X (61 cores with frequency 1.1 GHz)

«Tornado SUSU» supercomputer took the
157 place in 43-th issue of TOP500 rating
(June 2014).

Слайд 9

At the end of 2012, Intel launched
the first generation of the
Intel

Xeon Phi product family.

Intel® Xeon Phi™ Coprocessor

Intel Xeon Phi 7120P
Clock Speed 1.24 GHz
L2 Cache 30.5 MB
TDP 300 W
Cores 61
More threads 244

Intel Many Integrated Core Architecture
(Intel MIC ) is a multiprocessor computer architecture developed by Intel.

The core is capable of supporting
4 threads in hardware.

Слайд 10

HybriLIT: heterogeneous computation cluster Суперкомпьютер «Ломоносов» МГУ
CICC comprises
2582 Cores
Disk storage capacity
1800

August, 2014

Site: http:// hybrilit.jinr.ru

Слайд 11

2x Intel Xeon CPU
E5-2695v2
3x NVIDIA
TESLA K40S
2x Intel Xeon CPU
E5-2695v2
NVIDIA TESLA K20X
Intel Xeon Phi

Coprocessor 5110P

2x Intel Xeon CPU
E5-2695v2
2x Intel Xeon Phi
Coprocessor
7120P

1,2

HybriLIT: heterogeneous computation cluster

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT

Слайд 12

Multiple CPU cores with share memory
Multiple GPU
What we see: modern Supercomputers are hybrid

with heterogeneous nodes

Multiple CPU cores with share memory
Multiple Coprocessor

Multiple CPU
GPU
Coprocessor

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT

Слайд 13

Parallel technologies: levels of parallelism In the last decade novel computational technologies and facilities

becomes available: MP-CUDA-Accelerators?...

How to control hybrid hardware: MPI – OpenMP – CUDA - OpenCL ...

#node 1

#node 2

Слайд 14

In the last decade novel computational facilities and technologies has become available: MPI-OpenMP-CUDA-OpenCL...
It

is not easy to follow modern trends. Modification of the existing codes or developments of new ones ?

MPI

OpenMP

CUDA

OpenCL

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT

Слайд 15

Problem HCE: heat conduction equation
Initial boundary value problem for the heat conduction

equation:

D – rectangular domain with boundary Г :

Слайд 16

Problem HCE: computation scheme
Locally one-dimensional scheme:
reduction of a multidimensional problem to a

chain of one-dimensional problems

Let:

Difference scheme:
Explicit, implicit, … ?

Слайд 17

Step 1:
Difference equations (Ny-2)
on x direction
Step 2:
Difference equations (Nx-2)
on y

direction

under the additional conditions of conjugation,
boundary conditions and
normalization condition

Problem HCE: computation scheme

Слайд 18

Problem HCE: parallelization scheme

Parallel
Parallel

Слайд 19

Parallel Technologies

Слайд 20

OpenMP realization of parallel algorithm

Слайд 21

OpenMP (Open specifications for Multi-Processing)
OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform shared memory multiprocessing

programming in Fortran, C, C++.

Compiler directives

Environment
variables

Library
routines

export OMP_NUM_THREADS=3

http://openmp.org/wp/

Слайд 22

Compiler directive
Library
routines
OpenMP (Open specifications for Multi-Processing)
Use flag -openmp to compile using Intel compilers:
icc

–openmp code.c –o code

Слайд 23

OpenMP realization:
Multiple CPU cores that share memory
Table 2. OpenMP realization problem 1:

execution time and acceleration ( CPU Xeon K100 KIAM RAS)

Слайд 24

OpenMP realization:
Intel® Xeon Phi™ Coprocessor
Compiling:
icc -openmp -O3 -vec-report=3 -mmic algLocal_openmp.cc

–o alg_openmp_xphi

Table 3. OpenMP realization: Execution time and Acceleration
(Intel Xeon Phi, LIT).

Слайд 25

OpenMP realization:
Intel® Xeon Phi™ Coprocessor
Optimizations
The KMP_AFFINITY Environment Variable: The Intel®

OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units.
The interface is controlled using the KMP_AFFINITY environment variable.

Source:
https://software.intel.com/

compact

scatter

Слайд 26

CUDA (Compute Unified Device Architecture)
programming model, CUDA C

Слайд 27

CUDA (Compute Unified Device Architecture)
programming model, CUDA C
Source:
http://blog.goldenhelix.com/?p=374
Core 1
Core 2
Core 3
Core 4
CPU
GPU
Multiprocessor

•
•
•

(192 Cores)

Multiprocessor 2

(192 Cores)

Multiprocessor 14

(192 Cores)

Multiprocessor 15

(192 Cores)

•
•
•

CPU / GPU Architecture

2880 CUDA GPU cores

HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

Слайд 28

Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif
CUDA (Compute Unified Device Architecture)
programming model

Слайд 29

Device Memory Hierarchy
Registers are fast, off-chip
local memory has high latency
Tens of kb per

block, on-chip,
very fast

Size up to 12 Gb, high latency

Random access very expensive!
Coalesced access much more
efficient

CUDA C Programming Guide (February 2014)

HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

Слайд 30

Function Type Qualifiers
global
host
CPU
GPU
global
device
global void kernel ( void ){
}
int main{
…
kernel

<<< gridDim, blockDim >>> ( args );
…
}

dim3 gridDim – dimension of grid,
dim3 blockDim – dimension of blocks

Language extensions:
Kernel execution directive

HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

Слайд 31

Threads and blocks
HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT
int tid = threadIdx.x + blockIdx.x

* blockDim.x

tid – index of threads

Слайд 32

Scheme program on CUDA C/C++ and C/C++
HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

Слайд 33

nvcc -arch=compute_35 test_CUDA_deviceInfo.cu -o test_CUDA –o deviceInfo
Compilation
Compilation tools are a part of

CUDA SDK
NVIDIA CUDA Compiler Driver NVCC
Full information http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#axzz37LQKVSFi

HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

Слайд 34

Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)
Some GPU-accelerated Libraries

Слайд 35

Problem HCE: parallelization scheme

Parallel
Parallel

Слайд 36

Problem HCE: CUDA realization
Initialization: parameters of the problem and the computational scheme are

copied in constant memory GPU.
Initialization of descriptors: cuSPARSE functions

Calculation of array elements lower, upper and main diagonals and right side of SLAEs (1) :
Kernel_Elements_System_1 <<>>()

Parallel solution of (Ny-2) SLAEs in the direction x using
cusparseDgtsvStridedBatch()

Calculation of array elements lower, upper and main diagonals and right side of SLAEs (1) :
Kernel_Elements_System_2 <<>>()

Parallel solution of (Nx-2) SLAEs in the direction x using
cusparseDgtsvStridedBatch()

Слайд 37

Table 1. CUDA realization: Execution time and Acceleration
CUDA realization of parallel algorithm:
efficiency

of parallelization

Слайд 38

Problem HCE : analysis of results

Слайд 39

Hybrid Programming: MPI+CUDA:
on the Example of GIMM FPEIP Complex
GIMM FPEIP :

package developed for simulation of thermal processes in materials irradiated by heavy ion beams

Alexandrov E.I., Amirkhanov I.V., Zemlyanaya E.V., Zrelov P.V., Zuev M.I., Ivanov V.V., Podgainy D.V., Sarker N.R., Sarkhadov I.S., Streltsova O.I., Tukhliev Z. K., Sharipov Z.A. (LIT)
Principles of Software Construction for Simulation of Physical Processes on Hybrid Computing Systems (on the Example of GIMM_FPEIP Complex) // Bulletin of Peoples' Friendship University of Russia. Series "Mathematics. Information Sciences. Physics". — 2014. — No 2. — Pp. 197-205.

Слайд 40

To solve a system of coupled equations of heat conductivity which are a

basis of the thermal spike model in cylindrical coordinate system

GIMM FPEIP : package for simulation of thermal processes in materials irradiated by heavy ion beams

Multi-GPU

Слайд 41

GIMM FPEIP: Logical scheme of the complex

Слайд 42

Using Multi-GPUs

Слайд 43

MPI, MPI+CUDA ( CICC LIT, К100 KIAM)

Слайд 44

Hybrid Programming:
MPI+OpenMP, MPI+OpenMP+CUDA
The MultiConfigurationalTtimeDependnetHartree (for) Bosons method:
PRL 99, 030402

(2007), PRA 77, 033613 (2008)
It solves TDSE numerically exactly – see for benchmarking PRA 86, 063606 (2012)

MultiConfigurational Ttime Dependnet Hartree (for) Bosons

MCTDHB founders:
Lorenz S. Cederbaum,
Ofir E. Alon,
Alexej I. Streltsov

Since 2013 cooperation with LIT: the development of new hybrid implementations package

Ideas, methods, and parallel implementation of the MCTDHB package:
Many-body theory of bosons group in Heidelberg, Germany
http://MCTDHB.org

Слайд 45

Time-Dependent Schrödinger equation governs the physics of trapped ultra-cold atomic clouds
To solve the

Time-Dependent Many-Boson Schrödinger Equation
we apply the MultiConfigurationalTtimeDependnetHartree (for) Bosons method:
PRL 99, 030402 (2007), PRA 77, 033613 (2008)
It solves TDSE numerically exactly – see for benchmarking PRA 86, 063606 (2012)

One has to specify initial condition
and propagate Ψ(x,t)→ Ψ(x,t +Δt)

Слайд 46

All the terms of the Hamiltonian are under experimental control and can be

manipulated

1D-2D-3D: Control on dimensionality by changing the aspect ratio of the trap

BECs of alkaline, alkaline earth, and lanthanoid atoms (7Li, 23Na, 39K, 41K, 85Rb, 87Rb, 133Cs, 52Cr, 40Ca, 84Sr, 86Sr, 88Sr, 174Yb,164Dy, and 168Er )

The interatomic interaction can be widely varied with a magnetic Feshbach resonance… (Greiner Lab at Harvard. )

Magneto-optical trap

Слайд 47

Two generic rgimes: (i) non-violent (under-a-barrier) and
(ii) Explosive (over-a-barrier)
Two generic regimes: (i)

non-violent (under-a-barrier) and
(ii) Explosive (over-a-barrier)

Dynamics N=100: sudden displacement of trap and sudden quenches of the repulsion in 2D arXiv:1312.6174

Слайд 48

List of Applications
Modern development of computer technologies (multi-core processors, GPU , coprocessors and

other) require the development of new approaches and technologies for parallel programming.
Effective use of high performance computing systems allow accelerating of researches, engineering development and creation of a specific device.

Conclusion

Parallel programming technologies on hybrid architectures презентация

Содержание

Goal: Efficient parallelization of complex numerical problems in computational physics HETEROGENEOUS COMPUTATIONS

…TOP500 List – June 2014

Source:http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/TOP500 List – June 2014

Source:http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/TOP500 List – June 2014

«Lomonosov» Supercomputer , MSU >5000 computation nodesIntel Xeon X5670/X5570/E5630, PowerXCell 8i~36 Gb DRAM2 x

Custom languages such as CUDA and OpenCLSpecifications • 2880 CUDA GPU cores• Peak

«Tornado SUSU» Supercomputer, South Ural State University, Russia 480 computing units (compact and powerful

At the end of 2012, Intel launched the first generation of the Intel

HybriLIT: heterogeneous computation cluster Суперкомпьютер «Ломоносов» МГУ CICC comprises 2582 Cores Disk storage capacity 1800

2x Intel Xeon CPUE5-2695v23x NVIDIATESLA K40S2x Intel Xeon CPUE5-2695v2NVIDIA TESLA K20XIntel Xeon Phi

Multiple CPU cores with share memoryMultiple GPU What we see: modern Supercomputers are hybrid

Parallel technologies: levels of parallelism In the last decade novel computational technologies and facilities

In the last decade novel computational facilities and technologies has become available: MPI-OpenMP-CUDA-OpenCL... It

Problem HCE: heat conduction equation Initial boundary value problem for the heat conduction

Problem HCE: computation scheme Locally one-dimensional scheme:reduction of a multidimensional problem to a

Step 1: Difference equations (Ny-2)on x direction Step 2: Difference equations (Nx-2)on y

Problem HCE: parallelization scheme ParallelParallel

Parallel Technologies

OpenMP realization of parallel algorithm

OpenMP (Open specifications for Multi-Processing)OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform shared memory multiprocessing

Compiler directiveLibrary routinesOpenMP (Open specifications for Multi-Processing)Use flag -openmp to compile using Intel compilers:icc

OpenMP realization: Multiple CPU cores that share memoryTable 2. OpenMP realization problem 1:

OpenMP realization: Intel® Xeon Phi™ Coprocessor Compiling:icc -openmp -O3 -vec-report=3 -mmic algLocal_openmp.cc

OpenMP realization: Intel® Xeon Phi™ Coprocessor OptimizationsThe KMP_AFFINITY Environment Variable: The Intel®

CUDA (Compute Unified Device Architecture) programming model, CUDA C

CUDA (Compute Unified Device Architecture) programming model, CUDA CSource:http://blog.goldenhelix.com/?p=374Core 1Core 2Core 3Core 4CPUGPUMultiprocessor

Source: http://www.realworldtech.com/includes/images/articles/g100-2.gifCUDA (Compute Unified Device Architecture) programming model

Device Memory HierarchyRegisters are fast, off-chiplocal memory has high latencyTens of kb per

Function Type Qualifiers__global____host__CPUGPU __global__ __device____global__ void kernel ( void ){}int main{ … kernel

Threads and blocks HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT int tid = threadIdx.x + blockIdx.x

Scheme program on CUDA C/C++ and C/C++ HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

nvcc -arch=compute_35 test_CUDA_deviceInfo.cu -o test_CUDA –o deviceInfoCompilationCompilation tools are a part of

Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)Some GPU-accelerated Libraries

Problem HCE: parallelization scheme ParallelParallel

Problem HCE: CUDA realizationInitialization: parameters of the problem and the computational scheme are

Table 1. CUDA realization: Execution time and Acceleration CUDA realization of parallel algorithm:efficiency

Problem HCE : analysis of results

Hybrid Programming: MPI+CUDA: on the Example of GIMM FPEIP ComplexGIMM FPEIP :

To solve a system of coupled equations of heat conductivity which are a

GIMM FPEIP: Logical scheme of the complex

Using Multi-GPUs

MPI, MPI+CUDA ( CICC LIT, К100 KIAM)

Hybrid Programming: MPI+OpenMP, MPI+OpenMP+CUDA The MultiConfigurationalTtimeDependnetHartree (for) Bosons method: PRL 99, 030402

Time-Dependent Schrödinger equation governs the physics of trapped ultra-cold atomic cloudsTo solve the

All the terms of the Hamiltonian are under experimental control and can be

Two generic rgimes: (i) non-violent (under-a-barrier) and (ii) Explosive (over-a-barrier)Two generic regimes: (i)

List of ApplicationsModern development of computer technologies (multi-core processors, GPU , coprocessors and

Похожие презентации

Goal: Efficient parallelization of complex numerical problems in computational physics
HETEROGENEOUS COMPUTATIONS

…
TOP500 List – June 2014

Source:
http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014

Source:
http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014

«Lomonosov» Supercomputer , MSU
>5000 computation nodes
Intel Xeon X5670/X5570/E5630, PowerXCell 8i
~36 Gb DRAM
2 x

Custom languages such as CUDA and OpenCL
Specifications
• 2880 CUDA GPU cores
• Peak

«Tornado SUSU» Supercomputer, South Ural State University, Russia
480 computing units (compact and powerful

At the end of 2012, Intel launched
the first generation of the
Intel

HybriLIT: heterogeneous computation cluster Суперкомпьютер «Ломоносов» МГУ
CICC comprises
2582 Cores
Disk storage capacity
1800

2x Intel Xeon CPU
E5-2695v2
3x NVIDIA
TESLA K40S
2x Intel Xeon CPU
E5-2695v2
NVIDIA TESLA K20X
Intel Xeon Phi

Multiple CPU cores with share memory
Multiple GPU
What we see: modern Supercomputers are hybrid

In the last decade novel computational facilities and technologies has become available: MPI-OpenMP-CUDA-OpenCL...
It

Problem HCE: heat conduction equation
Initial boundary value problem for the heat conduction

Problem HCE: computation scheme
Locally one-dimensional scheme:
reduction of a multidimensional problem to a

Step 1:
Difference equations (Ny-2)
on x direction
Step 2:
Difference equations (Nx-2)
on y

Problem HCE: parallelization scheme

Parallel
Parallel

OpenMP (Open specifications for Multi-Processing)
OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform shared memory multiprocessing

Compiler directive
Library
routines
OpenMP (Open specifications for Multi-Processing)
Use flag -openmp to compile using Intel compilers:
icc

OpenMP realization:
Multiple CPU cores that share memory
Table 2. OpenMP realization problem 1:

OpenMP realization:
Intel® Xeon Phi™ Coprocessor
Compiling:
icc -openmp -O3 -vec-report=3 -mmic algLocal_openmp.cc

OpenMP realization:
Intel® Xeon Phi™ Coprocessor
Optimizations
The KMP_AFFINITY Environment Variable: The Intel®

CUDA (Compute Unified Device Architecture)
programming model, CUDA C

CUDA (Compute Unified Device Architecture)
programming model, CUDA C
Source:
http://blog.goldenhelix.com/?p=374
Core 1
Core 2
Core 3
Core 4
CPU
GPU
Multiprocessor

Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif
CUDA (Compute Unified Device Architecture)
programming model

Device Memory Hierarchy
Registers are fast, off-chip
local memory has high latency
Tens of kb per

Function Type Qualifiers
global
host
CPU
GPU
global
device
global void kernel ( void ){
}
int main{
…
kernel

Threads and blocks
HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT
int tid = threadIdx.x + blockIdx.x

Scheme program on CUDA C/C++ and C/C++
HETEROGENEOUS COMPUTATIONS GROUP, HybriLIT

nvcc -arch=compute_35 test_CUDA_deviceInfo.cu -o test_CUDA –o deviceInfo
Compilation
Compilation tools are a part of

Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)
Some GPU-accelerated Libraries

Problem HCE: parallelization scheme

Parallel
Parallel

Problem HCE: CUDA realization
Initialization: parameters of the problem and the computational scheme are

Table 1. CUDA realization: Execution time and Acceleration
CUDA realization of parallel algorithm:
efficiency

Hybrid Programming: MPI+CUDA:
on the Example of GIMM FPEIP Complex
GIMM FPEIP :

Hybrid Programming:
MPI+OpenMP, MPI+OpenMP+CUDA
The MultiConfigurationalTtimeDependnetHartree (for) Bosons method:
PRL 99, 030402

Time-Dependent Schrödinger equation governs the physics of trapped ultra-cold atomic clouds
To solve the

Two generic rgimes: (i) non-violent (under-a-barrier) and
(ii) Explosive (over-a-barrier)
Two generic regimes: (i)

List of Applications
Modern development of computer technologies (multi-core processors, GPU , coprocessors and