Other issues in HPC development
From HPCBugBase
The primary goal of HPCBugBase is to accumulate knowledge on software defects. However, our study shows that writing and debugging the code itself is hardly the only thing that HPC developers have to deal with. In fact, enourmous amounts of efforts seem to be spent on "side work", such as
- Obtaining an account and resource allocation on an HPC machine
- Setting up a machine environment: knowing how to login, how to transfer files, which compiler and compile options to use, how to run a job, etc.
- Waiting in a job queue
- Cope with various system problems, e.g., machine down
[edit] List of common non-programming issues
- Account/login
- Paperwork associated with an account application
- Some HPC machines require Kerberos, some require secure ID for login. Special client software/hardware may have to be set up just to login to such machines
- Setup
- Many HPC machines require the account to be set up to allow login without entering a password between the nodes. This usually means the configuration of ssh keys or rhost.
- Knowing right compile option for that particular machine.
- Some language features can be unavailable, broken or too slow in that particular machine, depending on the system architecture (shared/distributed, memory hierarchy/bandwidth,) and language implementation (e.g., some machines have a native UPC compiler, while others just convert the UPC code to MPI.)
- Resource allocation
- Long wait in a queue seems to be one of the biggest headache for users of HPC machines.
- Users tend to need interactive jobs with relatively small allocations during debugging, while they need a large allocation for real execution.
- Job scheduling algorithm
- System down
- Hardware problem is very common. One reason might be that today's HPC system is massively parallel and there are many failure points.
|
No pages link to $1. |
