The Expression Express

An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions.

What kinds of valarray expressions will take the express lane of Altivec optimization? It all comes down to the chunk.

The Altivec optimization is invoked whenever a const-chunkable expression constructs or is assigned to a chunkable expression, or a const-chunkable expression is summarized using sum, min or max.

Currently, only Altivec base types are optimizable: char, unsigned char, short, unsigned short, long, unsigned long and float. I expect to add some more types to the list eventually: long long, unsigned long long and certain std::complex types.

To chunk or not to chunk

A chunkable expression is an l-value that can be written to in chunks. Only valarrays of Altivec types and std::valarray <bool> are chunkable.

A const-chunkable expression is an r-value that can be read from in chunks. An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions — except for certain boolean expressions.

The following sorts of expressions are unchunkable (neither chunkable nor const-chunkable) and thus won’t be optimized:

valarrays of non-Altivec types
certain boolean expressions (see below)
apply called on expressions
subsets of expressions i.e. slice, gslice, mask and indirect.

For example:

std::valarray <float> vf1, vf2, vf3; std::valarray <double> vd1, vd2, vd3; vf1 // const-chunkable, Altivec base type vf1 * vf2 + vf3 // const-chunkable, only arithmetic cos (vf1) + sin (vf2) // const-chunkable, arithmetic and transcendental vd1 // unchunkable, not Altivec base type vd1 * vd2 + vd3 // unchunkable, not Altivec base type vf1 [vl1] // unchunkable, indirect subset vf1 [vf2 == vf3] // unchunkable, mask subset

Whose Truth is It?

C++ has the type bool which usually has the same size as the processor word size, so for the PowerPC it is 4 bytes long. Though a bool is either true or false, you can actually store any word-sized integer in a bool variable. C++ simply treats zero values as false, and nonzero values as true.

Altivec introduces the concept of sized booleans — booleans that are either 1, 2 or 4 bytes long. These are the results of various boolean-valued Altivec functions based on the element size, and have to have all bits 0 or 1.

I’ve encapsulated these sized booleans in the macstl::boolean template. For example, where vs1 and vs2 are vector signed shorts, then vec_eq (vs1, vs2) is a vector bool short, whose elements are 2 byte sized booleans, or macstl::boolean <short> objects.

Truth and Consequence

These differences complicate boolean expressions somewhat.

First, expressions that combine differently sized boolean chunks are not const-chunkable at all, since the corresponding Altivec types would have different numbers of elements.

Second, while std::valarray <bool> is chunkable, it is not const-chunkable: while you can write chunks into a std::valarray <bool> from some boolean expression, you can’t read chunks from an expression involving std::valarray <bool>. I put this restriction in because each std::valarray <bool> element can store an arbitrary integer thanks to its C++ legacy, but Altivec would expect it to have all bits 1 or 0.

But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.

Third, even if a boolean expression is const-chunkable, it must have long-sized chunks to construct or assign to a std::valarray <bool>. This follows from the fact that bool is actually long-sized on the PowerPC. The other kinds of const-chunkable expressions may still be summarized, or construct or be assigned to std::valarray <macstl::boolean <T> >, where T is either char, short or long.

For example:

std::valarray <bool> vb1, vb2; std::valarray <float> vf1, vf2; std::valarray <short> vs1, vs2; bool b; vb1 = vf1 == vf2; // optimized, since expression has long-sized chunks vb1 = vs1 == vs2; // not optimized, since expression has short-sized chunks vb1 = (vf1 == vf2) && (vs1 == vs2); // not optimized, combining different sized chunks vb2 = vb1; // not optimized, vb1 not const-chunkable vb2 = vb1 && (vf1 == vf2); // not optimized, vb1 not const-chunkable b = (vf1 == vf2).sum (); // optimized, vf1 == vf2 is const-chunkable b = (vs1 == vs2).sum (); // optimized, vs1 == vs2 is const-chunkable b = ((vf1 == vf2) && (vs1 == vs2)).sum (); // not optimized, combining different sized chunks

What’s in Store

Benchmarks have shown that trying to optimize unchunkable expressions didn’t yield much of a performance gain over scalar code. But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.

In the future, I may investigate making more of these chunkable.

» vec