Question 1

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

AImplement redundant switches with spanning tree protocol.

BMLAG for bonded interfaces across redundant switches.

CUse only one switch for all management and storage traffic.

DDisable VLANs and use unmanaged switches.

Answer : B

For the 'North-South' and 'Management/Storage' Ethernet fabrics in an NVIDIA AI Factory, high availability is paramount. Unlike the InfiniBand compute fabric, which uses its own routing logic, the Ethernet side relies on standard data center protocols. To provide true hardware redundancy and double the available bandwidth (Load Balancing), NVIDIA recommends MLAG (Multi-Chassis Link Aggregation). MLAG allows two physical switches to appear as a single logical unit to the DGX nodes. The DGX can then bond its two Ethernet NICs (e.g., in an 802.3ad LACP bond) and connect one cable to each switch. This configuration provides several benefits: if one switch fails, the traffic seamlessly stays on the other link without the slow convergence times associated with Spanning Tree Protocol (Option A). Furthermore, it allows the cluster to utilize the combined bandwidth of both links for heavy storage traffic (like NFS or S3 ingestion). Using a single switch (Option C) or unmanaged hardware (Option D) creates single points of failure and lacks the traffic isolation (VLANs) required for secure AI infrastructure.

Question 2

An engineer needs to verify NVLink isolation on a single node with 8 GPUs. Which NCCL test configuration stresses switch bisection bandwidth?

AUse NCCL_TESTS_SPLIT='DIV 8' with point-to-point tests

BUse all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT='AND 0x1'

CUse reduce_scatter_perf -b 8 -e 16G -f 2 -g 4

DUse all_reduce_perf -b 8 -e 16G -f 2 -g 8 without splits

Answer : B

To validate the robustness of the NVLink Switch fabric in a DGX H100, engineers must test how the switches handle traffic when the cluster is logically partitioned. While a standard all_reduce_perf test (Option D) shows aggregate throughput, it may not reveal issues with specific internal switch paths. Using the NCCL_TESTS_SPLIT environment variable allows for more granular stress testing. Specifically, using a bitwise mask like 'AND 0x1' (Option B) creates specific traffic subsets that force data through the internal NVLink switch bisection. This ensures that even when only half the GPUs are communicating---or when specific patterns are used---the switches can maintain full wire speed without internal contention. This is a critical validation step during the 'Bring-up' phase to ensure there are no manufacturing defects in the NVSwitch baseboard or the high-speed traces connecting the GPU modules.

Question 3

A system administrator is installing a GPU into a server and needs to avoid damaging the device. What item should be used?

AAnti-ESD strap

BGloves

CProtective film

DElectric screwdriver

Answer : A

High-performance NVIDIA GPUs, such as the H100 or A100, are highly sensitive to Electrostatic Discharge (ESD). A static spark that a human cannot even feel (less than 3,000 volts) is enough to permanently damage the microscopic circuits within the GPU die or the HBM (High Bandwidth Memory) modules. An Anti-ESD strap (or wrist strap) is the mandatory safety item for any technician handling internal server components. It works by grounding the technician, ensuring that any static charge built up on their body is safely dissipated before they touch the hardware. While gloves (Option B) might protect against sharp edges, they do not prevent ESD unless they are specifically rated as ESD-safe. Using an electric screwdriver (Option D) is generally discouraged for sensitive components to prevent over-tightening or mechanical stress. Therefore, an ESD strap is the single most critical tool for preventing 'Infant Mortality' of expensive AI hardware during physical installation.

Question 4

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

Adpkg -i doca-host-repo-ubuntu<version>_amd64.deb

Bapt-get install cuda-drivers

Csystemctl restart docker

Dapt-get remove nvidia-container-toolkit

Answer : C

The nvidia-ctk runtime configure command is a crucial step that modifies the Docker daemon configuration file (/etc/docker/daemon.json) to register the nvidia runtime. However, the Docker daemon only reads this configuration file during its initialization phase. Even though the toolkit is installed and the configuration file is updated, Docker will not be able to spawn GPU-accelerated containers until the service is refreshed. Executing sudo systemctl restart docker (or the equivalent for your container engine) is the mandatory final step. This forces Docker to reload its settings and recognize the NVIDIA Container Runtime as a valid option. Without this restart, attempting to run a container with the --gpus all flag will result in an error stating that the 'nvidia' runtime is not found or is unconfigured. This is a common point of failure in automated AI infrastructure deployments where the configuration script finishes, but the service state remains stale.

Question 5

A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

ASwitch from FP64 to FP32 precision.

BDisable GPU affinity.

CReduce test duration to 12 hours.

DVerify the matrix size is divisible by block size.

Answer : D

High-Performance Linpack (HPL) is the standard benchmark for stress-testing the computational stability and thermal endurance of an AI cluster. It solves a massive dense system of linear equations, and its mathematical configuration is highly sensitive. The HPL.dat configuration file defines the Problem Size ($N$) and the Block Size ($NB$). A fundamental requirement of the HPL algorithm is that the workload must be distributed evenly across the MPI processes and GPU threads. If the total matrix size $N$ is not an exact multiple of the block size $NB$, or if the grid dimensions ($P \times Q$) do not align with the hardware topology, the solver may encounter an 'illegal value' error or a 'residual too large' failure at the very beginning of the run. This is a configuration error, not a hardware fault. Reducing the precision (Option A) would invalidate the test, as HPL must run in FP64 to be considered a standard 'burn-in.' Verifying that $N$ is divisible by $NB$ ensures the mathematical integrity of the test while allowing the hardware to be pushed to its theoretical performance limits.

Free Practice Questions for NVIDIA NCP-AII Exam

Question 1

Question 2

Question 3

Question 4

Question 5