Sunday, May 07, 2006


Your product has little or no value if its functionality isn't available to users. Bugs, crashes, and network outages are examples of what might make your product's functionality unavailable at times.

Product managers therefore typically attach an availability constraint (nonfunctional requirement) to each functional requirement of the product. If one of the functions of the product is to generate reports, for example, a product manager should specify how likely it should be at any particular time that a user will be able to use this functionality.

The question with nonfunctional requirements is always the metric - how you measure them. How do you measure availability? Here are some options:
  • mean-time-between-failures (MTBF) - the average amount of time elapsed between failures to deliver the functionality.
  • failure rate - the frequency that the product fails to deliver the functionality. Failure rate is the reciprocal of MTBF, and often is expressed in terms of failures per hour.
  • uptime - percentage of the time that the functionality is available.
  • downtime - percentage of the time that the functionality is not available.
  • mean-time-between-system-abort (MTBSA) - the average amount of time elapsed between complete "reboots" of the system.
  • mean-time-between-critical-failure (MTBCF) - distinguishes between critical and noncritical failures.
  • maintenance free operating period (MFOP) - the average amount of time that the functionality is available without any special intervention or system maintenance.
Of course, a prospective customer will always want 100% uptime, but such availability is typically not practical to achieve. If you base a contract on 100% uptime, you will almost certainly be in violation of your contract at some point.

UPDATE: Scott Sehlhorst adds a number of important observations in this entry's comments. One thing he notes is that I neglected to mention MTTR:
  • mean-time-to-repair (MTTR) - the average amount of time it takes to repair the system after its functionality becomes unavailable. For hardware products, it usually refers to the time to replace a module or part. For software products, it can refer to the amount of time it takes to reboot or restart the system.
Also, some people use "availability" to refer strictly to uptime, and consider all of these parameters to be "reliability" metrics.


Scott Sehlhorst said...

Hey Roger,

I don't think you can realistically use MTBF for measuring software product availability. I notice you didn't suggest that you could, just pointing that out in case anyone assumed it (like I did initially).

Availability requires more than MTBF to measure (My background was mechE prior to software) - you have to also include MTTR (mean time to reset the system.

Availability = MTBF/(MTBF + MTTR).

MTBF-based non-functional requirements should also provide additional information in order to be unambiguously tested.

When I was doing hardware design, we would establish an MTBF measured in operations (say 100,000). We would either present that data, or look at typical usage patterns (10 operations per hour) and express MTBF in terms of hours - 10,000. We never converted MTBF to availability.

For software, we don't have creep-based or other cyclic failure mechanisms. I'm not sure what the true distribution is, I would guess random, but it definitely isn't a Weibull distribution - the one most commonly associated with hardware failures that are a function of repetition.

Since software failures seem to correlate more strongly to circumstances than repetition, I would suggest that an MTBF-based availability calculation is both unmeasureable and ill-advised.

Looking at uptime/downtime and MFOP (as you describe) are more valuable approaches for software.


Roger L. Cauvin said...

Sounds like you have some good experience with the various metrics, Scott. I did leave out MTTR, I am going to update the entry to include it.

I also do think it's important to provide context to these metrics. You can use a product in many different circumstances; the availability requirements should specify the possible circumstances to the extent practicable.

As I've stated before, just finding - or even exploring - the right combination of metrics is arguably more important than assigning all of the exact numbers and exhaustive circumstances. (When there's a contract involved, these other factors are obviously still very important.)

Paul Young said...

Another important use of MTTR is in services. All the major Telco's use MTTR and at Cisco services, where I worked, we used MTTR heavily as a metric for our managed service.

That brings up another topic; there is a serious lack of resources out there for the "Product Manager" of a service. I wouldn't help me as much now that I've moved to the product side but there are a lot of PM's I know would benefit from it. Maybe we can collaborate on this?

Paul Young