An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions.
What kinds of valarray expressions will take the express lane of Altivec optimization? It all comes down to the chunk.
The Altivec optimization is invoked whenever a const-chunkable expression constructs or is assigned to a chunkable expression, or a const-chunkable expression is summarized using sum, min or max.
Currently, only Altivec base types are optimizable: char, unsigned char, short, unsigned short, long, unsigned long and float. I expect to add some more types to the list eventually: long long, unsigned long long and certain std::complex types.
A chunkable expression is an l-value that can be written to in chunks. Only valarrays of Altivec types and std::valarray <bool>
are chunkable.
A const-chunkable expression is an r-value that can be read from in chunks. An expression is const-chunkable if it is either a valarray of an Altivec type, a scalar of an Altivec type or an expression containing only such valarrays and scalars, arithmetic operators and transcendental functions except for certain boolean expressions.
The following sorts of expressions are unchunkable (neither chunkable nor const-chunkable) and thus wont be optimized:
For example:
std::valarray <float> vf1, vf2, vf3;
std::valarray <double> vd1, vd2, vd3;
vf1 // const-chunkable, Altivec base type
vf1 * vf2 + vf3 // const-chunkable, only arithmetic
cos (vf1) + sin (vf2) // const-chunkable, arithmetic and transcendental
vd1 // unchunkable, not Altivec base type
vd1 * vd2 + vd3 // unchunkable, not Altivec base type
vf1 [vl1] // unchunkable, indirect subset
vf1 [vf2 == vf3] // unchunkable, mask subset
C++ has the type bool which usually has the same size as the processor word size, so for the PowerPC it is 4 bytes long. Though a bool is either true or false, you can actually store any word-sized integer in a bool variable. C++ simply treats zero values as false, and nonzero values as true.
Altivec introduces the concept of sized booleans booleans that are either 1, 2 or 4 bytes long. These are the results of various boolean-valued Altivec functions based on the element size, and have to have all bits 0 or 1.
Ive encapsulated these sized booleans in the macstl::boolean
template. For example, where vs1
and vs2
are vector signed shorts, then vec_eq (vs1, vs2)
is a vector bool short, whose elements are 2 byte sized booleans, or macstl::boolean <short>
objects.
These differences complicate boolean expressions somewhat.
First, expressions that combine differently sized boolean chunks are not const-chunkable at all, since the corresponding Altivec types would have different numbers of elements.
Second, while std::valarray <bool>
is chunkable, it is not const-chunkable: while you can write chunks into a std::valarray <bool>
from some boolean expression, you cant read chunks from an expression involving std::valarray <bool>
. I put this restriction in because each std::valarray <bool>
element can store an arbitrary integer thanks to its C++ legacy, but Altivec would expect it to have all bits 1 or 0.
But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.
Third, even if a boolean expression is const-chunkable, it must have long-sized chunks to construct or assign to a std::valarray <bool>
. This follows from the fact that bool is actually long-sized on the PowerPC. The other kinds of const-chunkable expressions may still be summarized, or construct or be assigned to std::valarray <macstl::boolean <T> >
, where T
is either char
, short
or long
.
For example:
std::valarray <bool> vb1, vb2;
std::valarray <float> vf1, vf2;
std::valarray <short> vs1, vs2;
bool b;
vb1 = vf1 == vf2; // optimized, since expression has long-sized chunks
vb1 = vs1 == vs2; // not optimized, since expression has short-sized chunks
vb1 = (vf1 == vf2) && (vs1 == vs2);
// not optimized, combining different sized chunks
vb2 = vb1; // not optimized, vb1 not const-chunkable
vb2 = vb1 && (vf1 == vf2); // not optimized, vb1 not const-chunkable
b = (vf1 == vf2).sum (); // optimized, vf1 == vf2 is const-chunkable
b = (vs1 == vs2).sum (); // optimized, vs1 == vs2 is const-chunkable
b = ((vf1 == vf2) && (vs1 == vs2)).sum (); // not optimized, combining different sized chunks
Benchmarks have shown that trying to optimize unchunkable expressions didnt yield much of a performance gain over scalar code. But even though they are not optimized for Altivec, the scalar code they produce will still be clean and tight.
In the future, I may investigate making more of these chunkable.