Nvidia DGX A100 Bedienungsanleitung

DGX A100 System
DU-09821-001_v01 | May 2020
User Guide

TABLE OF CONTENTS
DGX A100 System DU-09821-001_v01
|
5
Chapter 1. Introduction............................................................. 1
1.1 Hardware Overview ............................................................ 2
1.1.1 Component Specifications .............................................. 2
1.1.2 Mechanical Specifications ............................................. 2
1.1.3 Power Specifications .................................................... 3
1.1.4 Environmental Specifications .......................................... 5
1.1.5 Front Panel Connections and Controls................................. 5
1.1.6 Rear Panel Modules....................................................... 6
1.1.7 Motherboard Connections and Controls ............................... 7
1.1.8 Motherboard Tray Components ......................................... 7
1.1.9 GPU Tray Components ................................................... 8
1.2 Network Connections, Cables, and Adaptors............................... 9
1.2.1 Network Ports .......................................................... 9
1.2.2 Supported Network Cables ............................................. 10
1.2.3 Supported Network Adaptors .......................................... 10
1.3 DGX OS Software .............................................................. 11
1.4 Additional Documentation ................................................... 12
1.5 Customer Support ............................................................. 13
1.5.1 NVIDIA Enterprise Support Portal...................................... 13
1.5.2 NVIDIA Enterprise Support Email ...................................... 13
1.5.3 NVIDIA Enterprise Support - Local Time Zone Phone Numbers.... 13
Chapter 2. Connecting to the DGX A100........................................14
2.1 Connecting to the Console ................................................... 14
2.1.1 Direct Connection ....................................................... 14
2.1.2 Remote Connection through the BMC ................................ 16
2.2 SSH Connection to the OS .................................................... 18
Chapter 3. First-Boot Setup .......................................................19
Chapter 4. Quick Start and Basic Operation ...................................22
4.1 Installation and Configuration............................................... 22
4.2 Registration .................................................................... 22
4.3 Obtaining an NGC Account ................................................... 23

DGX A100 System DU-09821-001_v01
|
6
:
4.4 Turning DGX A100 On and Off................................................ 23
4.4.1 Startup Considerations.................................................. 23
4.4.2 Shutdown Considerations .............................................. 23
4.5 Verifying Functionality ....................................................... 24
4.5.1 Quick Health Check ..................................................... 24
4.6 Running NGC Containers with GPU Support ............................... 25
4.6.1 Using Native GPU Support .............................................. 25
4.6.2 Using the NVIDIA Container Runtime for Docker ................... 26
4.7 Managing CPU Mitigations .................................................... 28
4.7.1 Determining the CPU Mitigation State of the DGX System ........ 28
4.7.2 Disabling CPU Mitigations .............................................. 29
4.7.3 Re-enabling CPU Mitigations ........................................... 29
Chapter 5. Additional Features and Instructions..............................30
5.1 Managing the DGX Crash Dump Feature ................................... 30
5.1.1 Using the Script .......................................................... 30
5.1.2 Connecting to Serial Over LAN......................................... 31
Chapter 6. Managing the DGX A100 Self-Encrypting Drives................32
6.1 Overview........................................................................ 32
6.2 Installing the Software ....................................................... 33
6.3 Initializing the System for Drive Encryption............................... 33
6.4 Enabling Drive Locking ....................................................... 34
6.5 Initialization Examples ....................................................... 34
6.5.1 Example 1: Passing in the JSON File .................................. 34
6.5.2 Example 2: Generating Random Passwords.......................... 36
6.5.3 Example 3: Specifying Passwords One at a Time When Prompted 36
6.6 Disabling Drive Locking ....................................................... 37
6.7 Exporting the Vault............................................................ 37
6.8 Erasing your Data ............................................................. 37
6.9 USing the Trusted Platform Module......................................... 38
6.9.1 Enabling the TPM ........................................................ 38
6.9.2 Clearing the TPM......................................................... 38
6.10 Changing Disk Passwords, Adding Disks, or Replacing Disks ........... 39
6.11 Hot Removal and Re-Insertion ............................................. 39
6.12 Recovering From Lost Keys ................................................. 39
Chapter 7. Network Configuration...............................................41

DGX A100 System DU-09821-001_v01
|
7
:
7.1 Configuring Network Proxies................................................. 41
7.1.1 For the OS and Most Applications ..................................... 41
7.1.2 For apt..................................................................... 42
7.1.3 For Docker ................................................................ 42
7.1.4 Configuring Docker IP Addresses....................................... 42
7.1.5 Opening Ports ............................................................ 44
7.2 Connectivity Requirements for NGC Containers .......................... 44
7.3 Configuring Static IP Address for the BMC ................................. 45
7.3.1 Configuring a BMC Static IP Address Using ipmitool ................ 45
7.3.2 Configuring a BMC Static IP Address Using the System BIOS ...... 46
7.4 Configuring Static IP Addresses for the Network Ports .................. 47
7.5 Switching Between InfiniBand and Ethernet .............................. 48
7.5.1 Starting the Mellanox Software Tools................................. 49
7.5.2 Determining the Current Port Configuration ........................ 50
7.5.3 Switching the Port Configuration...................................... 50
Chapter 8. Configuring Storage...................................................52
Chapter 9. Updating and Restoring the Software.............................54
9.1 Updating the DGX A100 Software ........................................... 54
9.1.1 Connectivity Requirements For Software Updates ................. 54
9.1.2 Update Instructions ..................................................... 55
9.2 Restoring the DGX A100 Software Image................................... 56
9.2.1 Obtaining the DGX A100 Software ISO Image and Checksum File 56
9.2.2 Re-Imaging the System Remotely ..................................... 57
9.2.3 Creating a Bootable Installation Medium ............................ 58
9.2.4 Creating a Bootable USB Flash Drive by Using the dd Command . 58
9.2.5 Creating a Bootable USB Flash Drive by Using Akeo Rufus......... 59
9.2.6 Re-Imaging the System From a USB Flash Drive..................... 60
9.2.7 Retaining the RAID Partition While Installing the OS ............... 61
Chapter 10. Using the BMC ........................................................63
10.1 Connecting to the BMC ...................................................... 63
10.2 Overview of BMC Controls .................................................. 65
10.3 Common BMC Tasks .......................................................... 67
10.3.1 Changing BMC Login Credentials ..................................... 67
10.3.2 Using the Remote Console ............................................ 68

DGX A100 System DU-09821-001_v01
|
8
:
10.3.3 Setting Up Active Directory or LDAP/E-Directory ................. 68
10.3.4 Configuring Platform Event Filters .................................. 69
10.3.5 Uploading or Generating SSL Certificates .......................... 70
Chapter 11. Multi-Instance GPU ..................................................74
11.1 Enabling MIG on the DGX A100 System ................................... 74
11.2 Viewing Available Profiles .................................................. 75
11.2.1 Viewing GPU Profiles .................................................. 75
11.2.2 Viewing Compute Profiles............................................. 76
11.3 Creating MIG Instances...................................................... 76
11.3.1 Creating a GPU instance .............................................. 77
11.3.2 Creating a Compute Instance......................................... 77
11.4 Using MIG with Docker Containers ........................................ 78
11.5 Deleting MIG Instances ...................................................... 79
11.5.1 MIG Instance Deletion Process ....................................... 79
11.5.2 MIG Instance Deletion Examples ..................................... 79
Chapter 12. Security................................................................81
12.1 USer Security Measures .................................................... 81
12.1.1 Securing the BMC Port ................................................. 81
12.2 System Security Measures.................................................. 81
12.2.1 Secure Flash of DGX A100 Firmware................................. 82
12.2.2 NVSM Security........................................................... 82
12.3 Secure Data Deletion ....................................................... 82
12.3.1 Prerequisite ............................................................. 82
12.3.2 Instructions.............................................................. 83
Appendix A. Installing Software on Air-gapped DGX A100 Systems........85
A.1 Installing NVIDIA DGX A100 Software....................................... 85
A.2 Re-Imaging the System ....................................................... 86
A.3 Creating a Local Mirror of the NVIDIA and Canonical Repositories ... 86
A.3.1 Create Mirrors............................................................ 86
A.3.2 Configure the Target System........................................... 88
A.4 Installing Docker Containers................................................. 90
Appendix B. Safety..................................................................92
B.1 Safety Information ............................................................ 92
B.2 Safety Warnings and Cautions ............................................... 93

DGX A100 System DU-09821-001_v01
|
9
:
B.3 Intended Application Uses ................................................... 94
B.4 Site Selection .................................................................. 94
B.5 Equipment Handling Practices............................................... 94
B.6 Electrical Precautions ........................................................ 94
B.7 System Access Warnings ...................................................... 95
B.8 Rack Mount Warnings ......................................................... 96
B.9 Electrostatic Discharge (ESD)................................................ 97
B.10 Other Hazards ................................................................ 98
Appendix C. Compliance ......................................................... 100
C.1 United States .................................................................100
C.2 United States / Canada .....................................................101
C.3 Canada .........................................................................101
C.4 CE...............................................................................102
C.5 Australia and New Zealand .................................................103
C.6 Brazil ...........................................................................103
C.7 Japan...........................................................................103
C.8 South Korea ...................................................................106
C.9 China ...........................................................................109
C.10 Taiwan ........................................................................111
C.11 Russia/Kazakhstan/Belarus................................................113

DGX A100 System User Guide DU-09821-001_v01
|
1
CHAPTER 1 INTRODUCTION
The NVIDIA DGX™ A100 system is the universal system purpose-built for all AI
infrastructure and workloads, from analytics to training to inference. The system is built
on eight NVIDIA A100 Tensor Core GPUs.
This document is for users and administrators of the DGX A100 system.

DGX A100 System User Guide DU-09821-001_v01
|
2
Chapter 1 : Introduction
1.1 HARDWARE OVERVIEW
1.1.1 Component Specifications
1.1.2 Mechanical Specifications
Component Qty Description
GPU 8 NVIDIA A100 GPU
40 GB memory per GPU
CPU 2 2x AMD EPYC 7742 CPU w/64 cores
NVLink 12 600 GB/s GPU-to-GPU bandwidth
Storage (OS) 2 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array
Storage (Data Cache) 4 (base config)
8 (optional with 4
additional drives
installed)
3.84 TB NVMe U.2 SED (ea) in RAID 0 array
Network (cluster) card 8 Mellanox ConnectX-6 Single Port VPI
200 Gb InfiniBand (default)/Ethernet
Network (storage) card 1 (base config)
2 (optional with
additional I/O card
installed)
Mellanox ConnectX-6 Dual Port VPI
200 Gb Ethernet (default)/InfiniBand
System Memory (DIMM) 16 (base config)
32 (optional with
16 additional DIMMs
installed)
1 TB total memory in base configuration
2 TB total memory in optional configuration
BMC (out-of-band
system management)
1 1 GbE RJ45 interface, supports IPMI, SNMP,
KVM, HTTPS
In-band system
management
1 1 GbE RJ45 interface
Power Supply 6 3 kW
Feature Description
Form Factor 6U Rackmount
Height 10.39” (264 mm)
Width 19" (482.3 mm)
Depth 35.32" (897.2 mm)
System Weight 271 lbs (123 kg)

DGX A100 System User Guide DU-09821-001_v01
|
3
Chapter 1 : Introduction
1.1.3 Power Specifications
Support for N+N Redundancy
The DGX A100 includes six power supply units (PSU) configured for 3+3 redundancy. If
three PSUs fail, the system will continue to operate at full power with the remaining
three PSUs.
DGX A100 Locking Power Cords
The DGX A100 is shipped with a set of six (6) locking power cords that have been
qualified for use with the DGX A100 to ensure regulatory compliance.
Follow these instructions for using the locking power cords.
Input Specification for
Each Power Supply Comments
200-240
volts AC
6.5 kW max. 3000W @ 200-240 V,
16 A, 50-60 Hz
The DGX A100 system contains six
load-balancing power supplies.
Note: The DGX A100 will not operate with less than three PSUs.
WARNING:To avoid electric shock or fire, do not connect other power cords to
the DGX A100. For more details, see “Electrical Precautions”
Power Cord Feature Specification
Electrical 250VAC, 16A
Plug Standard C19/C20
Dimension 1800mm length
Compliance Cord: UL62, IEC60227
Connector/Plug: IEC60320-1

DGX A100 System User Guide DU-09821-001_v01
|
4
Chapter 1 : Introduction
Power Distribution Unit side
To INSERT, push the cable into the PDU socket
To REMOVE, press the clips together and pull the cord out of the socket
To UNLOCK the power cord, move the switch to the unlocked
position (indicator will show GREEN)
To LOCK the power cord, move the switch to the locked position
(indicator should show only RED)
Power Supply (System) side
To INSERT or REMOVE make sure the cable is UNLOCKED and push/pull
into/out of the socket
Andere Handbücher für DGX A100
4
Inhaltsverzeichnis
Andere Nvidia Server Handbücher
Beliebte Server Handbücher anderer Marken

iRobo
iRobo IPC2U Bedienungsanleitung

Nortel
Nortel 1000 Con?guration guide Bedienungsanleitung

Asus
Asus AP7500 Bedienungs- und Wartungshandbuch

Avid Technology
Avid Technology AirSpeed 5000 Bedienungsanleitung

HP
HP Integrity rx2600 Installationsanleitung

Milestone
Milestone Husky IVO 350T Bedienungsanleitung

















