MTTF (Mean Time To Failure) is a reliability measure used to quantify the dependability of computer systems and components.
Definition and Usage
MTTF represents the average time a component or system is expected to operate before failing
It's used as a key reliability metric in computer system design and evaluation
MTTF is calculated as the inverse of the failure rate
Example from the Materials In the disk subsystem example shown:
Individual components had different MTTFs:
Disks: 1,000,000 hours
SCSI controller: 500,000 hours
Power supply: 200,000 hours
Fan: 200,000 hours
SCSI cable: 1,000,000 hours
where is this value coming from first??
200,000 hours
The 200,000 hours value for the power supply's MTTF comes from empirical testing and reliability data collected by manufacturers. Here's how it typically works:
How manufacturers determine MTTF:
They test large numbers of power supplies
Run them until they fail
Record the failure times
Calculate the average lifetime
What does it mean: 1/200,000 in the context of the power supply?
This represents the power supply's failure rate, which is calculated by taking the inverse (1 divided by) its MTTF.
For the power supply:
MTTF = 200,000 hours
Failure rate = 1/200,000 per hour
This means in any given hour, there is a 1/200,000 chance of failure
To make this more concrete:
If you had 200,000 power supplies running
You would expect, on average, 1 of them to fail within an hour
Or if you had 1 power supply running for 200,000 hours
You would expect it to fail once during that time period
This is why it's written as a rate (1/200,000) - it represents the probability of failure per unit of time (per hour in this case).
System MTTF Calculation
For a system with multiple components, the overall MTTF is calculated by:
First calculating the total failure rate (sum of individual component failure rates)
Then taking the inverse of the total failure rate
In the example shown, the system MTTF was 43,500 hours
Practical Application
MTTF helps in:
Evaluating system reliability
Planning maintenance schedules
Comparing different design alternatives
Making decisions about redundancy (like dual power supplies)
This means you can only make the program 1/0.68 = 1.47× faster overall, even though you made part of it 5× faster!
Why We Care:
It shows limitations of improvements
Helps make design decisions
Shows why bottlenecks matter
Explains why we can't just infinitely improve one part of a system
Helps allocate resources effectively
Real-world example:
If you upgrade your computer's CPU to be 2× faster
But only 30% of your tasks are CPU-bound
You'll only see about a 23% overall improvement
Might make you reconsider if the upgrade is worth it!
In Amdahl's Law, the upper bound refers to the maximum possible speedup you can achieve, and it's limited by the portion of the system that CANNOT be improved.
Let me demonstrate:
Let's say we can improve a portion of our system (Fractionenh) and make it infinitely fast (Speedup → ∞):
Example: 0.22 or 22% means 22% of the system can be enhanced
Enh-Factor (Enhancement Factor):
HOW MUCH you can improve that portion
The multiplier or speedup you can achieve
Example: Enh-Factor = 5 means that portion can be made 5 times better
This means:
If 90% can be improved: max speedup is 10×
If 50% can be improved: max speedup is 2×
If 20% can be improved: max speedup is 1.25×
The key insight: You are always bounded by the part you cannot improve. This is why we say a chain is only as strong as its weakest link in system improvement, you're limited by the parts you cannot enhance.
MTTF (Mean Time To Failure) is a reliability measure used to quantify the dependability of computer systems and components.
Definition and Usage
Example from the Materials In the disk subsystem example shown:
The 200,000 hours value for the power supply's MTTF comes from empirical testing and reliability data collected by manufacturers. Here's how it typically works:
This represents the power supply's failure rate, which is calculated by taking the inverse (1 divided by) its MTTF.
For the power supply:
To make this more concrete:
This is why it's written as a rate (1/200,000) - it represents the probability of failure per unit of time (per hour in this case).
The Scenario:
Given Values:
Final Calculation:
Result:
This shows how adding redundancy (a second power supply) dramatically improves system reliability.
Think of it like this:
Let's use real numbers from the example:
Unavailability = 24/200,024 ≈ 0.00012 (0.012%) This means:
It's like a car:
Amdahl's Law
This law is crucial because it helps us understand the limits of speeding up a system. For practical example:
Imagine you're trying to speed up a computer program:
Using Amdahl's Law:
Why We Care:
Real-world example:
In Amdahl's Law, the upper bound refers to the maximum possible speedup you can achieve, and it's limited by the portion of the system that CANNOT be improved.
Let me demonstrate:
Let's say we can improve a portion of our system (Fractionenh) and make it infinitely fast (Speedup → ∞):
Using the formula:
Example:
This means:
The key insight: You are always bounded by the part you cannot improve. This is why we say a chain is only as strong as its weakest link in system improvement, you're limited by the parts you cannot enhance.