Unofficial study app · 3 domains · learn + practice · scenario-style MCQ
Timed, no feedback until the end. Scored in %, pass = 70%. Mirrors the real sitting.
Instant feedback + explanation after each question. No timer. Best for learning.
Explained from each component's perspective — what it is + how it relates to its neighbors. Read the GPU model first, then the three exam domains. Then drill in Practice mode.
Big picture. A GPU is a factory floor of thousands of workers; the CPU is the boss. Flow: CPU copies data → GPU memory, launches a grid of work, the GPU spreads it across its engines, thousands of threads each finish one piece, results copy back. AI math is the same matrix multiply repeated billions of times → embarrassingly parallel → perfect for a GPU.
The same job, two views — software maps onto hardware: Grid (whole launch) ↔ GPU · Block (a team) ↔ SM · Warp (32 threads in lockstep) ↔ scheduler · Thread (one worker) ↔ CUDA/Tensor core.
“I am a Thread.” I run one data element; I live in a block; I own fast private registers; I cooperate only with my block-siblings via shared memory.
“I am a Block.” I'm a team of threads that lives entirely on one SM — that physical togetherness is why my threads share fast memory & sync. I'm blind to other blocks, so the GPU can run us in any order (this is what lets the same code scale up).
“I am an SM (Streaming Multiprocessor).” I'm a physical engine on the GPU (an H100 has ~132 of me). I run blocks, split them into warps, and feed my CUDA cores (math) and Tensor Cores (matrix math for AI). When one warp stalls on slow memory I instantly switch to another → latency hiding (why you launch many threads).
Memory is often the real bottleneck, not compute: registers → shared/L1 → L2 → VRAM/HBM (slowest). HBM capacity caps model size; bandwidth caps speed.
The nesting: AI ⊃ ML ⊃ Deep Learning ⊃ GenAI/LLMs. ML learns patterns from data; DL = multi-layer neural nets; GenAI creates new content (transformers). ML styles: supervised (labeled), unsupervised (find structure in unlabeled), reinforcement (reward from actions, incl. RLHF).
“I am Training.” I learn the weights from huge data (forward → loss → backprop → update, over epochs). I'm heavy + sustained → many GPUs, big memory, fast interconnect, higher precision (FP32/BF16). Tools: NeMo, DGX, Base Command.
“I am Inference.” I use the trained model to predict. I'm lighter but latency-sensitive, and I often run at reduced precision (FP8/INT8) — Tensor Cores love this. Tools: TensorRT (optimize), Triton/NIM (serve). Heavily tested contrast.
“I am the GPU vs the CPU.” CPU = few powerful cores for sequential/branchy logic + orchestration. GPU = thousands of cores + Tensor Cores for the parallel matrix math. Parallel throughput, not clock speed.
Precision & quantization: FP32 → FP16/BF16 → INT8. Lower precision shrinks memory + speeds math; great for inference with minimal accuracy loss; training usually needs higher precision.
NVIDIA software stack: CUDA (the foundation/API) · cuDNN (DL primitives) · NGC (containers + pretrained models) · TensorRT (inference optimizer) · Triton (inference server: multi-framework, batching) · NIM (containerized inference microservice) · NeMo (build/fine-tune LLMs & GenAI) · RAPIDS (GPU data science: cuDF/cuML/cuGraph) · AI Enterprise (supported suite) · Base Command (training platform).
Domain SDKs (match framework→use case — common exam type): Merlin=recommenders · Riva=speech AI · Clara=healthcare · Morpheus=cybersecurity · Maxine=video comms · Metropolis/DeepStream=video analytics · Isaac=robotics · cuOpt=route optimization · TAO Toolkit=transfer learning.
Big picture: a single GPU is fast, but real AI needs many GPUs working as one — without starving them. Everything here does one of three jobs: compute (DGX/HGX), connect (NVLink/InfiniBand/DPU), sustain (storage/power/cooling). Scale UP = more GPUs in one node (NVLink); scale OUT = more nodes (InfiniBand/Ethernet).
“I am GPU memory (HBM).” I must hold weights + activations + gradients + optimizer state + the data batch. Too small → smaller batch/model or shard across GPUs. Capacity + bandwidth are the top GPU-selection specs.
“I am NVLink / NVSwitch.” High-bandwidth intra-node GPU↔GPU (far faster than PCIe); NVSwitch makes it all-to-all so 8 GPUs act like one.
“I am InfiniBand / RoCE / Spectrum-X.” The inter-node fabric. InfiniBand + RoCE (RDMA over Converged Ethernet) give RDMA (move data between node memories, bypass CPU). Spectrum-X = NVIDIA's AI-tuned Ethernet. GPUDirect RDMA = network/storage straight into GPU memory, no CPU bounce. I matter because gradient sync (all-reduce) happens every step — if I'm slow, every GPU waits.
“I am a DPU (BlueField).” The data center's third processor: GPU = compute, CPU = logic, DPU = infrastructure (offloads networking/storage/security off the CPU; enables isolation/multi-tenancy).
Scaling models: data parallelism = replicate model, split the batch (model must fit one GPU); tensor/pipeline parallelism = split the model itself across GPUs (when it doesn't fit).
“I am Storage / Power / Cooling.” Storage = high-throughput parallel access (parallel FS, NVMe) so GPUs aren't starved. Power/cooling = GPU racks are kW-dense → high-density power + often liquid cooling. On-prem vs cloud: steady high use + data residency → on-prem/sovereign; bursty → cloud; mix = hybrid. DGX/HGX = integrated multi-GPU systems.
Big picture: GPUs are expensive → ops keeps them busy, healthy, shared, orchestrated. Three jobs: watch (monitor), orchestrate (schedule), slice (share). Low utilization usually = a feeding problem (data/network), not too little compute.
“I am NVIDIA-SMI vs DCGM.” SMI = quick per-GPU CLI status. DCGM = cluster-grade health + telemetry (utilization, memory, temp, power, ECC errors, throttling) → feeds Prometheus/Grafana.
“I am the GPU Operator.” On Kubernetes I auto-deploy the driver, container toolkit, device plugin & DCGM so K8s can schedule GPU pods. Beside me: Slurm (HPC batch scheduler), Run:ai / Base Command (GPU scheduling).
“I am MIG vs vGPU.” MIG = hardware-partition one GPU into isolated instances (own memory+compute) → predictable QoS, great for many small inference jobs / multi-tenancy. vGPU / time-slicing = softer software sharing across VMs. (NVLink/NVSwitch do the opposite — combine GPUs.)
“I am Triton.” I serve optimized models in production (multi-framework, dynamic batching). TensorRT tuned them first; NIM packages them as microservices.
Reliability & efficiency: checkpointing saves model/optimizer state periodically so a long multi-node job resumes after a failure instead of restarting from zero. Low GPU utilization = wasted money → fix with batching (Triton), MIG, consolidation, autoscaling before buying more hardware.
This is largely a data-sheet memorization exam — questions give you numbers and you deduce the part. Learn the H100 & DGX H100 sheets.
H100 SXM: 80 GB HBM3 · 700 W TDP · 16,896 CUDA cores · 900 GB/s NVLink (Gen4) · liquid-cooled. H100 NVL: 188 GB (94/GPU pair), 350–400 W. Cue: 900 GB/s or 700 W → SXM; 188 GB → NVL.
DGX H100: 8× H100 · 640 GB HBM3 · 32 PFLOPS FP8 · 4× NVSwitch (7.2 TB/s) · 10× ConnectX-7 @ 400 Gb/s · dual Xeon Platinum 8480C (112 cores) · 2 TB RAM · 30 TB NVMe. SuperPOD (32 nodes) ≈ 1 ExaFLOP FP8.
NVLink (intra-node only): V100 300 → A100 600 → H100 900 GB/s (18 links) → Blackwell ~1.8 TB/s. Quoted figures are bidirectional — halve for one direction. Can also be CPU↔CPU (NVLink-C2C, Grace Hopper).
InfiniBand: …HDR (200) → NDR (400 Gb/s) → XDR; H100 era = NDR on Quantum-2. Trap: NDR400 = 4 lanes; NDR200 = 2 lanes (half bandwidth).
Latency ladder: InfiniBand+RDMA ~1–2 µs · through the CPU ~10–30 µs · RoCE over plain TCP/IP ~20–50 µs. Fast path = RDMA, bypass the CPU.
NCCL collectives: All-Reduce ⭐ (reduce + result in all ranks → gradient sync) · Reduce (one rank) · Broadcast (root→all) · All-Gather · Reduce-Scatter · Gather · Scatter · All-to-All. Most questions hit all-reduce.
Power & cooling: TDP = heat to dissipate; 700 W → liquid/direct-to-chip. Trap: liquid cuts heat, not power. PUE = facility ÷ IT power (ideal 1.0, good ≤1.2). Rack math: 64×700 W = 44.8 kW ÷ ~15 kW/rack → round up to 4 racks.
Precision/memory: Hopper Transformer Engine = FP8 + FP16; quantize LLMs to FP8 for H100. Inference memory ≈ weights; training ≈ 12–16× (can OOM a 70B on 8× H100).
Models/formats: ResNet-50 = image-classification CNN; ResNet/DenseNet fix the vanishing gradient; ONNX = framework-agnostic model format (served by Triton). FHE / NVIDIA ARX = compute on encrypted data in an untrusted data center.