Author Archive for John Worley

The Parallel Universe: Mining Parallelism Pt. VII

Gems in the Dust. As if we hadn’t found enough parallelism already (is there ever really enough?), we shouldn’t overlook the mundane, but ubiquitous instructions that handle bookkeeping, intermediate results, and program state manipulations. Even subtler is the practice of computing results that may not be needed if a certain condition is true,
but that can be computed in parallel with evaluating the condition. It’s easy to miss these gems in the dust of computing because we don’t often think of them on algorithmic level. But when the code is executed, they are there, they are necessary, and very often they can be easily moved around to increase the overall instructions executed each cycle.

When I was a graduate student, frustrated with the lack of progress on my thesis, I once challenged my adviser as to how pervasive parallelism was in computing and the world in general. He told me that parallelism is everywhere; you just have to look for it, grab hold of it, and use it. Actually, in the way of all truly great teachers, he simply said “that’s not going to make much of a thesis,” and let me work out the rest. So when someone tells me that there’s no parallelism to be found in one or another computing problems, I simply smile and let them see for themselves the endless expanses of the Parallel Universe.

The Parallel Universe: Mining Parallelism Pt. VI

Pipelines for Fun and Performance. We talked about hardware pipelines and the wonderful things they do for performance, when we treat them well, of course. Since software cannot decompose computations below the instruction level, pipelines in software enhance performance
by overlapping useful computation with operational latencies. Latencies arise from memory delays and from complex operations that don’t complete in a single cycle. In Itanium, the multimedia instructions, floating-point operations, and moving data to and from the floating point unit, all have multiple cycle latency. Executed serially, latencies spell poor performace; overlapped with multiple stages of execution, the latencies are effectively buried and allow the computation to produce results orders of magnitude faster.

Software pipelines are, of course, possible, but clumsy and not particularly effective on most architectures (Cray and it’s descendents excepted, of course). Itanium was designed with software pipelines in mind: large integer and floating registers, predicates to dynamically control execution, and parallel units so that multiple pipeline stages can execute in parallel.

But there’s one little-known feature that makes software pipelines the performance and parallelism champion: register renaming. The effect of this nifty feature is to move results from certain integer, floating point, and predicate registers to the next higher numbered register without actually moving the data, so that the data in the pipeline naturally moves from one stage to the next using no additional instructions or temporary registers. Using these architectural features, it is possible, and down right fun, to build software pipelines that can deliver results at amazing rates: the HP Vector Math Library is a shining example of what can be accomplished, but the possibilities are only limited by the programmers’ imagination and motivation.

In the last post in this vein, I’ll dig into an often overlooked, but extremely rewarding, source of parallelism.

The Parallel Universe: Mining Parallelism Pt. V

Shuffle, Fold, and Flatten. In the first series of postings, I put forward the idea that Itanium compels us to rethink programs and programming. Part of this is what I call the two dimensional approach: considering a program as a structure in time and execution space. Until the day that someone develops a better tool, the best way to visualize this is with a spreadsheet program, such as the one available in OpenOffice. Once a programmer can see the available execution space, it’s almost impossible not to want to use it. When dealing with instructions with multiple cycle latency, such as floating point operations, one can almost envision another “half” dimension of in-flight results.

When coding loops for Itanium, the two (plus) dimensional thinking often requires reorganizing the computation to best fit the architecture. If each trip through the loop is independent, unrolled loops can be executed in parallel; if not, the loop can be unrolled so that the next loop trip begins executing as soon as it’s arguments are available (data flow, anyone?). Other loops can be restructured to execute faster by doing more operations each time, so-called height reduction, because the extra operations can be done in parallel.

Another common computational structure that proves a source of parallelism is the rhomboidal “shape” of many loops: the loop starts out with a few base computations, which then lead to more and more computations in the middle of the loop; as the loop completes, the number of computations decreases. Using predication to disable the tail the first time through, the tail can be run in parallel with the start of the next loop trip; the programmer can either drop out of the loop to handle the last pass through the tail, or use predicates to disable the head and body of the loop the last time through. This method anticipates a fundamental parallel computation method for which Itanium is uniquely suited: the software pipeline.

The Parallel Universe: Mining Parallelism Pt. IV

Decisions, Decisions. When given the choice between cake and ice cream, most two-year children will pick both (of course). As adults, or at least people old enough to act like adults, we know that you can’t have your cake and eat it too … or can you? One major source of branching, and hence pipeline stalls, are the dynamic choices in the execution path, better known to programmers as the ‘if’ statement. Unlike looping branches, these are hard to predict, although coding the likely case as the true clause is probably best practice. Still, what if we could avoid the branch altogether?

Enter predication, a key Itanium feature for maximizing performance. Itanium predicates are a set of 63 bits that can be independently set by set or cleared by various instructions, compare being the most common. Better yet, predicates may be set in complementary pairs: one predicate records the comparison result, the other records the logical inverse of the comparison result. The complete Itanium predicate calculus is rich and powerful and a topic for a more technical discussion; suffice to say that a diligent programmer can evaluate complex condition expressions in a very small sequences of predicate operations.

So, how does this help us have our cake and eat it, too? Instead of evaluating a condition and branching, which just uses a computed predicate to enable or disable an otherwise unconditional branch, the programmer can code both branches of the ‘if’ in parallel, using the computed predicate pair to dynamically select one or the other. Other variations are possible: executing several independent ‘if’ clauses in parallel with multiple sets of predicates, or executing multiple branches from a sequence of ‘if … else if … else’ statements, where the predicates computed from earlier conditionals can affect the evaluation of subsequent conditionals. It’s true that we’re not necessarily increasing our IPC with this technique, but we are keeping the pipeline full (of cake and ice cream?), reducing code size, and keeping more instructions in cache – all performance wins.

The Parallel Universe: Mining Parallelism Pt. III

SIMD and MIMD. One basic type of parallelism is the single instruction, multiple data, or SIMD. This mode performs the same operation on multiple data items in parallel. As with other sources of parallelism, Itanium gives the programmer or compiler explicit control over what instructions are executed in the same cycle. Further, many instructions, the ALU type, can be executed on most of the functional units. Employing the large register set and the three-operand instruction (C = A op B), SIMD parallelism is readily realized.

Additionally, the instruction set architecture defines 1-, 2-, and 4-byte parallel operations for integer computation, as well as pair-wise floating-point operations, for even higher levels of SIMD operation.

These same features are utilized for a second type of parallelism: multiple instruction, multiple data, or MIMD. This mode is, in fact, the most fundamental modis operandi for Itanium. SIMD, using the same instructions or the parallel instructions described above, and MIMD can be combined, mixed, matched, and tailored to produce the highest level of performance for almost any data-intensive computation.

In the next few posts, I’ll be digging into the many rich veins of parallelism that arise from the nature of computations and the structure of programs.