Huawei Atlas 900: the newly launched world’s fastest AI training cluster breaks the computing record with a lead of 10s
By Town Kassis 2019-09-18 680 0
Introduction to Atlas 900 AI training Cluster
The neural network architecture trained on large data sets covers image recognition, natural language processing, real-time video analysis and intelligent recommendation system. Training these neural network models requires a lot of floating-point computing power.
In recent years, significant advances have been made in the computing power and training methods of a single AI processor, but on a single machine, the time required for AI training is still impractical. Therefore, it is necessary to improve the floating-point computing power of neural network training system with the help of large-scale distributed AI cluster environment.
The Atlas 900 AI training cluster, made up of thousands of Ascend 910 AI processors, is the fastest AI training cluster in the world and represents the pinnacle of computing power in the world today.
Its total computing power is equivalent to the computing power of 500000 PC units in 256P~1024P FLOPS @ FP16,.
Atlas 900 AI training Cluster leading Technology advantage
he industry leading AI computing
The Atlas 900 AI training cluster uses the industry's most powerful single chip, Ascend 910 AI processor, each with 32 da Vinci AI Core, single chips built into 32 da Vinci AI Core, single chips to provide twice as much computing power (256TFLOPS@FP16) as the industry. The Atlas 900 AI training cluster interconnects thousands of Ascend 910 AI processors to create the industry's first computing cluster.
Ascend 910 AI processor adopts SoC design, integrates "AI arithmetic power, general arithmetic power, high speed and large bandwidth I", greatly unloads the data preprocessing task of Host CPU, and fully improves the training efficiency.
Optimal cluster network
The Atlas 900 AI training cluster adopts three kinds of high-speed interconnection modes of "HCCS, PCIe 4.0, 100G Ethernet". The 100 TB fully interconnected non-blocking dedicated parameter synchronization network reduces the network delay and reduces the gradient synchronization delay by 10%. 70%.
Within the AI server, the PeAscend 910 AI processors are interconnected through the HCCS high-speed bus. The Ascend 910 AI processor and CPU are interconnected with the latest PCIe 4.0 (rate 16Gb/s) technology, which is twice as fast as PCIe 3.0 (8.0Gb/s) technology, which makes data transmission faster and more efficient. At the cluster level, the CloudEngine 8600 series switches for the data center provide a single-port 100Gbps switching rate to connect all AI servers in the cluster to a high-speed switching network.
The original iLossless intelligent lossless switching algorithm carries on the real-time learning and training to the network traffic in the cluster, and realizes the network 0 packet loss and E2E μs delay.
System level tuning
Through Huawei aggregation communication library and job scheduling platform, the Atlas 900 AI training cluster integrates three high-speed interfaces, HCCS, PCIe 4.0 and 100G RoCE, to fully release the powerful performance of the HuaAscend 910 AI processor.
Huawei collective communication library provides the distributed parallel library needed for training network. The communication library + network topology + training algorithm is optimized at the system level to achieve cluster linearity > 60%, which greatly improves the efficiency of job scheduling.
Extreme heat dissipation system
The traditional data center mostly uses the air cooling technology to dissipate the heat of the equipment, but in the era of artificial intelligence, the traditional data center is facing a great challenge. High power devices such as CPU and AI chips bring greater heat island effect and require more efficient cooling. Liquid cooling technology can meet the ultra-high requirements of high power, high density deployment and low PUE in data center.
The Atlas 900 AI training cluster adopts full liquid cooling scheme, innovative design of the strongest cabinet level airtight insulation technology in the industry, supporting > 95% liquid cooling ratio. The single cabinet supports ultra-high heat dissipation up to 50kW, achieving extreme data center energy efficiency with PUE < 1.1s.
In addition, in terms of space saving, compared with the 8kW air cooler cabinet, the computer room space is saved by 79%. The extreme liquid cooling heat dissipation technology meets the needs of high power, high density equipment deployment and low PUE, and greatly reduces the TCO of customers.
The leading Benchmark Index of Atlas 900AI training Cluster
Huawei has deployed a Atlas 900 AI training cluster on the Huawei cloud with 1024 Ascend 910 AI processors. Based on the most typical ResNet-50 v1.5 model and ImageNet-1k dataset, the Atlas 900AI training cluster takes only 59.8 seconds to complete training, ranking first in the world.
The ImageNet-1k dataset contains 1.28 million images with an accuracy of 75.9%. With the same accuracy, the test results of the other two mainstream manufacturers in the industry are 70.2s and 76.8s respectively, and the Atlas 900 AI training cluster is 15% faster than the second.
Atlas 900 AI Cluster applicable scenario
The Atlas 900 AI cluster mainly provides super computing power for neural network training in large data sets, and can be widely used in scientific research and business innovation, allowing researchers to train AI models such as images, video, and voice more quickly. It allows humans to explore the mysteries of the universe more efficiently, predict the weather, explore for oil, and accelerate the commercial process of autopilot.
The Atlas 900 AI cluster can also provide cloud services, providing abundant and economical computing resources in a cloud way, and an easy-to-use, efficient, full-process AI platform. For customers to bring the ultimate experience of "easy to obtain, affordable, easy to use" inclusive AI computing power.
|You may also want to read:|
|How to control the music by Amazfit GTS?|
|GoPro Hero 8, what’s new upgrades and worth buying?|
|How to reset the i11 TWS wireless Bluetooth earphones?|
● Over 300,000 products
● 20 different categories
● 15 local warehosues
● Multiple top brands
● Global payment options: Visa, MasterCard, American Express
● PayPal, Western Union and bank transfer are accepted
● Boleto Bancario via Ebanx (for Brazil)
● Unregistered air mail
● Registered air mail
● Priority line
● Expedited shipping
● 45 day money back guarantee
● 365 day free repair warranty
● 7 day Dead on Arrival guarantee (DOA)
2019-09-06By Joe Horner
2020-03-10By Bernadina Kempton
2019-12-19By Joe Horner
2020-03-09By Jenica Ramian
2019-08-12By Sigismondo Eisenhower
2020-03-09By Elwira Iakovou