User:CTho/Overclocking
Overclocking
[edit]Device performace
[edit]Signal integrity
[edit]Failure mechanisms
[edit]EM
[edit]Electromigration is a well-known failure mechanism for a CPU. As electrons flow through wires, they occasionally bump into the metal atoms and move them. As atoms get moved away from a point, the wire there gets narrower until it breaks. One might think that when a wire has thinned a little bit but not failed completely, the chip would work at lower clock speeds, but I would expect that once this point is reached the wire will rapidly reach the point where it has failed completely. The rate of electromigration is largely determined by the current density. As a wire thins, the current density goes up, making electromigration happen even faster at the thin point. Now, the atoms that get knocked out of place have to end up somewhere. This brings us to a second way electromigration can kill a chip: by creating short circuits. As metal atoms pile up somewhere, they can create a bridge to a nearby wire. Page 725 of http://www.crystalresearch.com/crt/ab35/721_a.pdf has a good picture showing electromigration effects over time (actually, many of the pictures are interesting, and the paper is not a difficult read).
The rate of electromigration is exponentially dependent on temperature. It will also increase as the voltage is raised because the current (and therefore current density) increases (and the electrons may also be more energetic - I need to look this up).
Margins
[edit]Hot-E
[edit]NBTI
[edit]Heat & Voltage
[edit]Unorganized notes
[edit]Why I don't overclock any more
[edit]- data corruption
- when a problem occurs, finding the cause becomes impossible (or at least a pain in the butt)
- negligible real performance gain w/o big $
What is silent data corruption?
[edit]Basically, when your CPU makes a mistake, there is some chance you notice - for example, due to a program crashing, or visual artifacts in a game - but there is also a chance that you don't notice. There are different errors a CPU can make that have different results.
One error that would not cause any problems is a mistake in a branch prediction. Since the CPU knows that the branch predictor is often wrong, it checks the predictions and recovers if the predictions are wrong. Since the branch predictor doesn't affect correctness, the worst case result would be a tiny slowdown. Another error that wouldn't matter is an error that corrupts information that isn't reused. Every time the CPU adds two numbers, it tracks some "flags" that note if the result is positive, negative, or zero. If a flag is miscalculated - but isn't checked - nothing bad happens.
An error that would cause a crash might be a mistake in a TLB, or a mistake reading / decoding an instruction (which could result in the CPU trying to do something illegal). Depending on what code happens to be running when this happens, the result could be a single application crashing or the whole machine crashing.
An error that could silently corrupt data would be a bit flipping in a floating point calculation. If, say, you're doing your taxes and your CPU is overclocked, it's possible that one of the bits of a floating point calculation doesn't get back into the register file before the clock ticks, so the wrong answer is stored. You might not notice so long as the error doesn't cause an obvious discrepancy... that's silent data corruption.