Question 1

An instance of NVIDIA Fabric Manager service is running on an HGX system with KVM. A System Administrator is troubleshooting NVLink partitioning.

By default, what is the GPU polling subsystem set to?

AEvery 1 second

BEvery 30 seconds

CEvery 60 seconds

DEvery 10 seconds

Answer : B

Comprehensive and Detailed Explanation From Exact Extract:

In NVIDIA AI infrastructure, the NVIDIA Fabric Manager service is responsible for managing GPU fabric features such as NVLink partitioning on HGX systems. This service periodically polls the GPUs to monitor and manage NVLink states. By default, the GPU polling subsystem is set to every 30 seconds to balance timely updates with system resource usage.

This polling interval allows the Fabric Manager to efficiently detect and respond to changes or issues in the NVLink fabric without excessive overhead or latency. It is a standard default setting unless specifically configured otherwise by system administrators.

This default behavior aligns with NVIDIA's system management guidelines for HGX platforms and is referenced in NVIDIA AI Operations materials concerning fabric management and troubleshooting of NVLink partitions.

Question 2

A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the ''PENDING'' state and unable to progress to the ''RUNNING'' state.

Which Slurm command can help the user identify the reason for the job's pending status?

Asinfo -R

Bscontrol show job <jobid>

Csacct -j <job[.step]>

Dsqueue -u <user_list>

Answer : B

Comprehensive and Detailed Explanation From Exact Extract:

The Slurm command scontrol show job <jobid> provides detailed information about a specific job, including its current status and, crucially, the reason why a job might be pending. This command shows job details such as resource requirements, dependencies, and any issues blocking the job from running.

sinfo -R displays information about nodes and their reasons for being in various states but does not provide job-specific reasons.

sacct -j shows accounting data for jobs but typically does not explain pending causes.

squeue -u lists jobs by user but does not detail the pending reasons.

Hence, scontrol show job <jobid> is the appropriate command to diagnose why a Slurm job remains in the pending state.

Question 3

An organization only needs basic network monitoring and validation tools.

Which UFM platform should they use?

AUFM Enterprise

BUFM Telemetry

CUFM Cyber-AI

DUFM Pro

Answer : B

Comprehensive and Detailed Explanation From Exact Extract:

The UFM Telemetry platform provides basic network monitoring and validation capabilities, making it suitable for organizations that require foundational insight into their network status without advanced analytics or AI-driven cybersecurity features. Other platforms such as UFM Enterprise or UFM Pro offer broader or more advanced functionalities, while UFM Cyber-AI focuses on AI-driven cybersecurity.

Question 4

What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?

ATo configure the node as a container orchestration manager

BTo enable the node to monitor GPU utilization across the cluster

CTo allow the node to manage software images and provision other nodes

DTo assign the node as a storage manager for certified storage

Answer : C

Comprehensive and Detailed Explanation From Exact Extract:

In NVIDIA Base Command Manager (BCM), assigning the provisioning role to a node enables that node to manage software images and perform provisioning tasks for other nodes in the cluster. This role allows automated deployment and configuration of cluster nodes, ensuring consistency and simplifying large-scale management. It is not primarily responsible for container orchestration, GPU monitoring, or storage management.

Question 5

You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.

What must you complete before pulling the container? (Choose two.)

AInstall Docker and the NVIDIA Container Toolkit on the server.

BSet up a Kubernetes cluster to manage the container.

CInstall TensorFlow or PyTorch manually on the server before pulling the container.

DGenerate an NGC API key and log in to the NGC container registry using docker login.

Answer : A, D

Comprehensive and Detailed Explanation From Exact Extract:

Before pulling and running an NVIDIA NGC container on a stand-alone server, you must:

Install Docker and the NVIDIA Container Toolkit to enable container runtime with GPU support.

Generate an NGC API key and authenticate with the NGC container registry using docker login to pull private or public containers.

Setting up Kubernetes or manually installing deep learning frameworks is unnecessary when using containers as they include the required frameworks.

Free Practice Questions for NVIDIA NCP-AIO Exam

Question 1

Question 2

Question 3

Question 4

Question 5