How Fujitsu Can Improve the Operational Stability of the K Computer

Information systems are today an integral part of the wider social infrastructure that is responsible for an increasingly important role in almost every aspect of our lives. The stable and uninterrupted performance of these information systems is vital to our safety and security. Fujitsu positions such information systems as “social systems” and is involved in a variety of activities to improve the performance stability of the information systems used in public infrastructure.

But this requires an equivalent commitment from clients to work together with Fujitsu. Here we describe case studies where clients have achieved ongoing operational stability by applying themselves diligently to the operation of their information systems. We hope that this will serve as food for thought for with regard to the issue of operational stability.

What Is Operational Stability to the K Computer?

The K computer, jointly developed by Fujitsu and the Advanced Institute for Computational Science (AICS) at RIKEN (see photo), is a massive supercomputer featuring some 80,000 computation nodes and 1.3 petabytes of memory. It is used to perform highly advanced simulations that are beyond the capacity of conventional supercomputers, and has applications in a wide range of areas including earth sciences, disaster modeling, medical care and manufacturing. The K computer is being shared among a number of research institutes and private-sector interests.

The K computer is a joint facility, and AICS is keen to maximize the number of users of the supercomputer by ensuring uninterrupted uptime. Thus, operational stability is of paramount importance. Fujitsu is working closely with AICS on a number of strategies and approaches designed to achieve the above challenge.

K computer

Initiatives to Maintain K’s Uptime and Utilization Rates

Computer components such as disks and CPUs have known failure rates that are heavily influenced by their frequency of use and operating load. Being a very large-scale system, the K computer has an incredible number of these components, which means that many disks, CPUs and other parts are being replaced on a monthly basis.

AICS has set up an inventory for frequently replaced components, to ensure that required components are readily available. AICS also uses failure forecasting to ensure that components are replaced prior to their failure, thus preventing system downtime before it can occur.

Innovations have also been introduced with regard to the environment in which the K computer operates, particularly with regards to the air-conditioning system. A study of the correlation between air-conditioning temperature settings and component failure rates has found that blowing cool air to lower the room temperature can help to reduce such rates. In this way, AICS and Fujitsu are working together by innovating in a variety of ways to improve the operational stability of the K computer and boost the overall utilization rate.

Maximizing Accessibility to the K Computer

The primary goal of AICS and Fujitsu is to maximize availability to as many users as possible of the K computer joint facility. There is a long list of applicants keen to use the facility, so to maximize the number of research groups and private-sector organizations using K, it is necessary to minimize the number of free computation nodes at any one time. This requires careful scheduling of requests to achieve a balance between the large and small jobs submitted by users. Scheduling innovations are being implemented to eliminate free computation nodes. Through a combination of improved scheduling and use of tools, AICS is able to boost the fill factor—the percentage of operating hours actually spent on computation. AICS also publishes utilization rates, allowing prospective users to identify peak usage periods and quiet periods so they can plan their application to minimize waiting time.

The K computer project has combined AICS’ supercomputer knowhow as an expert in this area with the technical expertise of Fujitsu to achieve the overall goal of improving operational stability. Business continuity requires a two-pronged approach designed to minimize downtime while also enabling a rapid response to unexpected failures. Fujitsu is committed to maximizing operational stability for the information systems of our clients, and in turn helping them to create a more secure and stable society for all.