xCAT is used to manage the servers in Summit and Sierra, two supercomputers taking the #1 and #2 spots on the Nov 2018 Top 500.
Extreme scalable with hierarchy architecture
To manage and provision thousands of bare-metal servers in supercomputing data center, a scalable architecture is mandatory. xCAT supports hierarchy architecture with multiple Service nodes, Compute nodes are partitioned and managed by those Service Nodes.
Simple and fast provisioning with diskless mode
In most of HPC sites, Compute Nodes are expected to be stateless. xCAT supports
diskless installation, which enables the great simplicity and flexibility in managing both the software stack ( Including out-of-box Nvidia CUDA and Mellonox OFED ) on Compute Nodes and its deployment lifecycle.
End-2-end infrastructure discovery
The hierarchy topology is complicated and takes administrator too much effort to deploy on Day-0. xCAT supports rich set of
discovery capabilities and
ONIE switch provisioning, it simplifies the process by iterating over the following steps: connect, describe, then discover.
Zero-touch failed server replacement
Inevitably, there will be some failed servers detected. It is a simple task in xCAT to replace the fault servers. Just replace the failed server with the new one and power it on. After that, xCAT will take care of the node provisioning and information refreshing for you.
Deep Learning requires lots of computational resources to process analytics on large amounts of data, xCAT could be used to manage and deploy the Deep Learning environment.
Manage Deep Learning Elements
The cognitive computing environment requires to deploy servers with GPU and install deep learning frameworks and libraries. It is a burden for data scientist to manage such an environment as there are lots of dependent pieces from different sources, like RPM, Conda and Python, etc. xCAT can help you to mirror and configure those repositories for all of the dependent pieces, plus the NVIDIA hardware drivers, CUDA (parallel computing platform API) Toolkit, and NCCL (Collective Communications Library). After that, you could have an offline central repository serving for the whole deep learning cluster. In addition, integrated with
PowerAI, xCAT could support the enterprise grade deep learning solution based on IBM® Power Systems™ servers.
Simple deployment of deep learning environment
With the powerful bare-metal provisioning and flexible
diskful osimage definition, xCAT lets you deploy the deep learning clusters in minutes. Everything is automated, spend your time in developing instead of deploying. With
xcat-inventory, you can source control your environments into Git repository. And it is possible for you to take risks and try new things in the testing and agile development without worrying about the recovery.
Scalability
Although deep learning environment is not so large today, a single management node is enough. But xCAT still lets you scale it beyond a single server, and quickly scale to a whole cluster.
Besides HPC cluster used for production, many HPC customers are still requiring a development environment on-premise for testing and agile developing. Virtualization environment is often used for such case as setting up a bare-metal cluster takes considerable time and effort.
Deployment of Virtualization infrastructure
xCAT supports deployment of different kinds of
virtualization infrastructures: Redhat RHV(KVM), IBM PowerKVM and Vmware ESXi. You can easily deploy those hypervisors on bare metal servers, and create virtual machine instances with xCAT. In addition, you can provision those virtual machines in the same way as xCAT provisions physical machines.
On-demand Elastic Scaling
xCAT supports re-purposing of unused HPC servers into virtualization environment with fast re-provisioning, and move it back again when HPC workloads require on schedule. This improves the resource utilization and offers the underlying infrastructure software defined capability.
RESTful API
And xCAT supports
RESTful APIs, to help with development of your own self-service portal.