26.10.2018

Derevyannie Fermi Chertezhi3940891

This article includes a, but its sources remain unclear because it has insufficient. Please help to this article by more precise citations. ( August 2014) () Nvidia Fermi Release date April 2010 Transistors 40 nm and 28 nm History Predecessor Successor Fermi is the codename for a GPU developed by, first released to retail in April 2010, as the successor to the microarchitecture.

Apr 16, 2015  iopscience.iop.org.

It was the primary microarchitecture used in the. It was followed by, and used alongside Kepler in the,, and, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the x000 series, Quadro NVS models, as well as in computing modules. All desktop Fermi GPUs were manufactured in 40 nm, mobile Fermi GPUs in 40 nm and 28 nm.

Fermi is the oldest microarchitecture from NVIDIA that received support for the Microsoft's rendering API Direct3D 12 feature_level 11. The architecture is named after, an Italian physicist. NVIDIA Fermi architecture Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches. Fermi Graphic Processing Units () feature 3.0 billion transistors and a schematic is sketched in Fig. • Streaming Multiprocessor (SM): composed of 32 cores (see Streaming Multiprocessor and CUDA core sections). • GigaThread global scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).

• Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8GB/s). • DRAM: supported up to 6GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).

• Clock frequency: 1.5 GHz (not released by NVIDIA, but estimated by Insight 64). • Peak performance: 1.5 TFlops. • Global memory clock: 2 GHz. • DRAM: 192GB/s. Streaming multiprocessor [ ] Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection). Load/Store Units: Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to.

Special Functions Units (SFUs): Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied. CUDA core [ ] Integer Arithmetic Logic Unit (ALU): Supports full 32-bit precision for all instructions, consistent with standard programming language requirements. It is also optimized to efficiently support 64-bit and extended precision operations. Floating Point Unit (FPU): Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction (see Fused Multiply-Add subsection) for both single and double precision arithmetic.

Derevyannie

Up to 16 double precision fused multiply-add operations can be performed per SM, per clock. Polymorph-Engine [ ] Fused Multiply-Add [ ] Fused Multiply-Add (FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately. Warp scheduling [ ] The Fermi architecture uses a two-level, distributed scheduler.

Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies. Note that 64-bit operations consumes both the first two execution columns. Razgrafka zon sk 63 1. This implies that an SM can issue up to 32 single-precision (32-bit) floating point operations or 16 double-precision (64-bit) floating point operations at a time. GigaThread Engine: The GigaThread engine schedules thread blocks to various SMs Dual Warp Scheduler: At the SM level, each warp scheduler distributes warps of 32 threads to its execution units.

Threads are scheduled in groups of 32 threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double precision instructions do not support dual dispatch with any other operation. [ ] Performance [ ] The theoretical single-precision processing power of a Fermi GPU in is computed as 2 (operations per FMA instruction per CUDA core per cycle) × number of CUDA cores × shader clock speed (in GHz). Note that the previous generation could dual-issue MAD+MUL to CUDA cores and SFUs in parallel, but Fermi lost this ability as it can only issue 32 instructions per cycle per SM which keeps just its 32 CUDA cores fully utilized.