Mean Time To Failure | MTTF

MTTF (Mean Time To Failure) is a reliability measure used to quantify the dependability of computer systems and components.

Definition and Usage

MTTF represents the average time a component or system is expected to operate before failing
It's used as a key reliability metric in computer system design and evaluation
MTTF is calculated as the inverse of the failure rate

Example from the Materials In the disk subsystem example shown:

Individual components had different MTTFs:
Disks: 1,000,000 hours
SCSI controller: 500,000 hours
Power supply: 200,000 hours
Fan: 200,000 hours
SCSI cable: 1,000,000 hours

where is this value coming from first??

200,000 hours

The 200,000 hours value for the power supply's MTTF comes from empirical testing and reliability data collected by manufacturers. Here's how it typically works:

How manufacturers determine MTTF:

They test large numbers of power supplies
Run them until they fail
Record the failure times
Calculate the average lifetime

What does it mean: 1/200,000 in the context of the power supply?

This represents the power supply's failure rate, which is calculated by taking the inverse (1 divided by) its MTTF.

For the power supply:

MTTF = 200,000 hours
Failure rate = 1/200,000 per hour
This means in any given hour, there is a 1/200,000 chance of failure

To make this more concrete:

If you had 200,000 power supplies running
You would expect, on average, 1 of them to fail within an hour
Or if you had 1 power supply running for 200,000 hours
You would expect it to fail once during that time period

This is why it's written as a rate (1/200,000) - it represents the probability of failure per unit of time (per hour in this case).

System MTTF Calculation

For a system with multiple components, the overall MTTF is calculated by:
First calculating the total failure rate (sum of individual component failure rates)
Then taking the inverse of the total failure rate
In the example shown, the system MTTF was 43,500 hours

Practical Application

MTTF helps in:
Evaluating system reliability
Planning maintenance schedules
Comparing different design alternatives
Making decisions about redundancy (like dual power supplies)

The Scenario:

You have 2 power supplies
Only 1 needs to work for the system to function
It's a redundant system

The Calculation Process:

Failure ratepower supply pair 
= 2C1 × failure ratepower supply × unavailability
= 2 × (1/MTTF) × [MTTR/(MTTF+MTTR)]
≈ 2/MTTF × (MTTR/MTTF)
= (2 × MTTR)/MTTF²

Given Values:

MTTF (Mean Time To Failure) = 200,000 hours
MTTR (Mean Time To Repair) = 24 hours

Final Calculation:

MTTFpower supply pair = MTTF²/(2 × MTTR)
                      = 200,000²/(2 × 24)
                      = 40,000,000,000/48
                      ≈ 830,000,000 hours

Result:

The dual power supply system is about 4,150 times more reliable than a single power supply
Because: 830,000,000/200,000 ≈ 4,150

This shows how adding redundancy (a second power supply) dramatically improves system reliability.

The formula for unavailability: [MTTR/(MTTF+MTTR)]

This formula represents the fraction of time a system is NOT working.

Think of it like this:

MTTF = Mean Time To Failure (how long until it breaks)
MTTR = Mean Time To Repair (how long to fix it)
MTTF + MTTR = Total cycle time (time working + time being repaired)

Let's use real numbers from the example:

MTTF = 200,000 hours (time running)
MTTR = 24 hours (time being fixed)
Total cycle = 200,000 + 24 = 200,024 hours

Unavailability = 24/200,024 ≈ 0.00012 (0.012%) This means:

System is down 0.012% of the time
System is available 99.988% of the time

It's like a car:

If your car runs for 1,000 hours
Takes 1 hour to repair
Unavailability = 1/(1000+1) ≈ 0.001
It's unavailable 0.1% of the time

Amdahl's Law

timeNew 
= timeold × [(1 - FractionEnh) + (FractionEnh / Speedup)]

This law is crucial because it helps us understand the limits of speeding up a system. For practical example:

Imagine you're trying to speed up a computer program:

40% of the program can be improved (FractionEnh = 0.4)
60% cannot be improved (1 - FractionEnh = 0.6)
You make the improvable part 5× faster (Speedup = 5)

Using Amdahl's Law:

timenew = timeold × [0.6 + (0.4/5)]
timenew = timeold × [0.6 + 0.08]
timenew = timeold × 0.68

This means you can only make the program 1/0.68 = 1.47× faster overall, even though you made part of it 5× faster!

Why We Care:

It shows limitations of improvements
Helps make design decisions
Shows why bottlenecks matter
Explains why we can't just infinitely improve one part of a system
Helps allocate resources effectively

Real-world example:

If you upgrade your computer's CPU to be 2× faster
But only 30% of your tasks are CPU-bound
You'll only see about a 23% overall improvement
Might make you reconsider if the upgrade is worth it!

In Amdahl's Law, the upper bound refers to the maximum possible speedup you can achieve, and it's limited by the portion of the system that CANNOT be improved.

Let me demonstrate:

Let's say we can improve a portion of our system (Fractionenh) and make it infinitely fast (Speedup → ∞):

Using the formula:

Timenew 
= timeold × [(1 - Fractionenh) + (Fractionenh / Speedup)]

As Speedup approaches infinity:
timenew = timeold × (1 - Fractionenh) + (Fractionenh / ∞)
timenew = timeold × (1 - Fractionenh) + 0

Example:

If 90% can be improved (Fractionenh = 0.9)
Even with infinite speedup of that 90%
You still have 10% that can't be improved
Maximum speedup = 1/0.1 = 10×

Fractionenh (Enhanced Fraction):
The PORTION of the system that can be improved
Expressed as a decimal or percentage
Example: 0.22 or 22% means 22% of the system can be enhanced
Enh-Factor (Enhancement Factor):
HOW MUCH you can improve that portion
The multiplier or speedup you can achieve
Example: Enh-Factor = 5 means that portion can be made 5 times better

This means:

If 90% can be improved: max speedup is 10×
If 50% can be improved: max speedup is 2×
If 20% can be improved: max speedup is 1.25×

The key insight: You are always bounded by the part you cannot improve. This is why we say a chain is only as strong as its weakest link in system improvement, you're limited by the parts you cannot enhance.