On my current project I will soon delve into optimization tasks at the level of inline assembly for PowerPC. These days the use of inline assembly is almost never justified. It's about as unportable as code can be and it's nearly impossible to understand once it's written. Most of the time, unless you devote a great deal of energy or unless you are using processor features (SIMD, for example) inaccessible through C or C++, hand-written assembly will actually be slower than compiler generated code. Furthermore, most of what you learned a few years ago about optimizing assembly code simply does not hold any more. For example, what's the faster way to multiply an integer by 5 on the x86?
A. x = (x << 2) + x;
B. x *= 5;
Old school assembly says that A is faster. Not true anymore. The imul (integer multiply) instruction is as fast as a single shift on the x86 these days. Counting cycles? Hard to do these days with deep pipelines, instruction reordering, branch prediction and unpredictable memory latency. The most effective way to optimize assembly seems to be aggressive profiling and trial and error. Gone are the days when you can optimize code by counting cycles with the processor manual tables in hand. Even so, these guidelines are important:
1. Most importantly, make sure you have the most efficient algorithm possible for the job before moving to assembly! There are a million good reasons for this and nothing could be more embarrassing than having your finely tuned assembly bubble sort owned by a C (or Java!) mergesort written in 12 minutes.
2. Profile changes aggressively and with the finest resolution (usually the CPU cycle counters) possible.
3. Space out memory accesses. Because of memory latency (and asynchronous memory access), you can hide cycles between your memory reads and writes.
4. Know your memory access patterns and take advantage of them. Do you only write and never read back from certain areas of memory? It may be beneficial to write-through directly to memory and avoid caching. It can also be useful to prefetch memory in certain cases.
5. Keep your data structures small enough to fit completely in cache. This will yield enormous benefits if you can do it.
6. Use SIMD where appropriate. This can give great benefit and itself may justify moving to inline assembly. However, don't spend an excessive number of cycles trying to fit data into SIMD-ready structures. It'll probably cost more than you'll get from it. Use SIMD when it's a good fit.
7. Unroll loops - to a point. Unroll tiny loops until they no longer provide a performance benefit. Keep unrolling and profiling. When you've gone too far you'll see a significant performance drop as that piece of code outgrows the instruction cache. If you have enough information on the hardware, you can figure out where this threshold will be.
8. On PC use SIMD for 64-bit integer arithmetic instead of the atrocious code that's generated for this by Visual C++.
Just so you know, this entry is subject to revision. Have any other guidelines? Let me know about 'em!