Physics of Failure - Shipulski On Design

Improving Product Robustness 101

December 24th, 2009

Improving product robustness is straightforward and difficult. Here’s how to do it.

Identify specific failure modes, prioritize them, and go after the biggest ones first. Failure modes can be identified through multiple sources. Warranty data is sometimes coded by failure mode (more precisely, symptom type), so start there. The number one failure mode in this type of data is typically “no problem found”, so be ready for it. Analysis of the actual products that come back is another good way. Returned product is routed to the appropriate engineer who analyzes it and enters the failure mode into a database. A formal design FMEA generates a list of prioritized failure modes through the risk priority number (RPN), where larger is more important. To do this, engineers are hauled into a room and a facilitator helps them come up with potential failure modes. One caution – the process can generate many failure modes, more than you can fix, so make the top five or ten go away and don’t argue the bottom fifty. It makes no sense to even talk about number eleven if you haven’t fixed the top ten. But the best way I have found to identify failure modes (problems) that are meaningful to the customer is to ask the technical services group for their top five things to fix. They will give you the right answer because they interact daily with customers who have broken product. They won’t expect you to listen to them (you never listened before), so surprise them by fixing one or two on their list. They will be grateful you listened (they’ll likely want to buy you coffee for the rest of your career) and your customers will notice.

Once failure modes are identified, define the physics of failure – why the product breaks. This is tough work and requires focused thought and analysis. If, when you break the product, it “looks like” the ones coming back from the field, you have defined the physics of failure. This is the same thing as replicating the problem in the lab. Once that’s defined, create an automated test rig or experimental setup that breaks the product in a way that captures the physics of failure. I call this test rig a robustness surrogate because it stands in for the actual failure mode seen in the field. The robustness surrogate should break the product as fast as possible while retaining the physics of failure so you can break it and fix it many times before product launch. The robustness surrogate should be designed to break the product within minutes, not hours or days – the faster the better.

To know if product robustness is improved, the baseline (or existing) design is broken on the robustness surrogate. The new design must survive longer on the robustness surrogate than the baseline design. The result is A/B data (baseline design/ new design) that is presented at the design review using a simple bar graph format which I call big-bar-little-bar. Keep improving robustness of the new design even if it outperforms the baseline design by a factor of ten – that’s not good enough for your customers.

Don’t stop improving robustness until you run out of time, and don’t stop if you meet the arbitrary MTBF specification. Customers like improved robustness, and in this case too much of a good thing is wonderful.

Using this method, I reduced warranty cost per unit by 75% over a five year period. It worked.