Its object store RADOS is responsible for many Ceph problems. In order to recognize what is wrong, one can use the supplied tools sensibly.
Ceph has long established itself as a scalable storage solution for dynamic environments. What once began as a crazy idea in Sage Weil’s head now naturally competes in tenders against classic storage offers from the competition – and regularly comes out on top. In most cases, the “single point of administration” tips the scales: At some point, all conventional memory is full and can no longer be expanded, but a Ceph cluster scales almost infinitely. At the same time, Ceph enjoys the reputation of being extremely stable: Many administrators are almost surprised because a Ceph cluster, once installed, does not cause them any trouble for years. If something does go wrong, however, the panic is great.
But even disaster-experienced Ceph admins regularly face defective clusters like the proverbial ox in front of a mountain. On the one hand, Ceph has changed very significantly in some areas over the past few years. On the other hand, many training courses available on the market place little emphasis on debugging. Especially since complex error scenarios can hardly be recreated in the artificial environment of a training course – after all, a multi-petabyte Ceph cluster is not available for every course.
The good news: The necessary basic knowledge for confidently dealing with errors in Ceph is not as extensive as it appears. A few command line tools and the ability to correctly interpret their output are important. In combination with some basic knowledge about RADOS, the causes of most problems can be quickly identified and eliminated – or at least temporarily circumvented until a permanent solution can be implemented. This article reveals the basic details and serves as an aid for admins in an emergency: What to do if Ceph is on fire?