Sunday, March 31, 2013

Aligning Memory

What is alignment?

Let's say we have two sticks, one with tick marks every 4mm, and one with tick marks every 8mm.  When we put these two sticks next to each-other we find all the places they match up.
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
:   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :
:       :       :       :       :       :       :       :       :       :       :
Our two stick match at every 8mm.  That's pretty simple and strait forward, so let's try something evil.  Again with two sticks, one with marks every 3mm, and one with marks every 7mm.
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
:  |  |  |  |  |  |  :  |  |  |  |  |  |  :  |  |  |  |  |  |  :  |  |  |  |  |
:      |      |      :      |      |      :      |      |      :      |      |
Our sticks match at every 21mm, or align at 21mm, or have an alignment of 21mm.  Now we know what alignment is: the interval at which two or more intervals meet.  Which also happens to be mathematically identical to the Least Common Multiple.


How do you find a common alignment?

In our first example above, it was easy.  The second interval was a factor of the first interval, therefore the alignment was the same as the second interval.  In the second example we need to find the least common multiple of 3 and 7.  Here's how:
constexpr std::size_t
greatest_common_factor( std::size_t a ) noexcept
{
  return a;
}

constexpr std::size_t
greatest_common_factor( std::size_t a, std::size_t b ) noexcept
{
  return b ? greatest_common_factor( b, a % b ) : a;
}

template< typename ...Args >
constexpr std::size_t
greatest_common_factor( std::size_t a, std::size_t b,
                        std::size_t c, Args ... args) noexcept
{
  return greatest_common_factor( greatest_common_factor( a, b ), c, args... );
}

constexpr std::size_t
least_common_multiple( std::size_t a ) noexcept
{
  return a;
}

constexpr std::size_t
least_common_multiple( std::size_t a, std::size_t b ) noexcept
{
  return a / greatest_common_factor( a, b ) * b;
}

template< typename ...Args >
constexpr std::size_t
least_common_multiple( std::size_t a, std::size_t b,
                       std::size_t c, Args ... args) noexcept
{
  return least_common_multiple( least_common_multiple( a, b ), c, args... );
}
Where least_common_multiple( 3, 7 ) == 21.  So now that we know how to do arbitrary alignment, let's forget all about it.


What is memory alignment?

Memory alignment is in reference to a type of data, an integer, a float, etc., that has an interval at which it should be placed in memory.  An integer with an alignment of 2, placed at position 5, is unaligned, its position doesn't match its alignment interval.  If it was placed at position 6, or 8, or any multiple of its alignment, it would be aligned.


Typically data types, in a system, are all powers of 2.  This makes it so all types with weaker (smaller interval) alignment can be positioned near types with stricter (larger interval) alignment easily.
struct stuff
{
  std::int32_t a;
  std::int16_t b;
  std::int16_t c;
};
If every type in stuff has an alignment equal to its size, then how could stuff be laid out in memory?  What's stuff alignment?
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
aaaabbcc
aaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbcc
    aaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbcc

The alignment of stuff is 4. If std::int32_t had an alignment of 7, and std::int16_t had an alignment of 3, what would the alignment of stuff be?
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
aaaaaaa  bbbccc      aaaaaaa  bbbccc      aaaaaaa  bbbccc      aaaaaaa  bbbccc
       aaaaaaa bbbccc       aaaaaaa bbbccc       aaaaaaa bbbccc
              aaaaaaabbbccc        aaaaaaabbbccc        aaaaaaabbbccc       

Depending on the padding between a and b, the origin, for where the alignment starts, changes.  The aligment of stuff would be 21, for each version with different padding, though.  Aren't powers of 2 much nicer?


How does memory alignment affect you?

As it turns out, keeping you data aligned is important.  SSE/AVX data needs to be properly aligned in memory to get the most benefit out of them.  Some times unaligned memory access is not allowed; ARMv4, ARMv5, and ARM device memory.  I'm not picking on ARM, they're just a more recent example.  In general you want all your data to be properly aligned all the time, and to copy any unaligned data into an aligned space before using it.


Thankfully the compiler is very good at keeping everything aligned by default.  Aggregate types all have corrected offsets for each member to maintain it's alignment, and padding at the end of the aggregate to keep all its members aligned when there is an array of the aggregate.  Problems creep in when we use non-basic types that have extended alignment requirements (SSE/AVX), or access memory directly.  For instance:
ifstream file( "example.bin", ios::in | ios::binary );
auto bytes = new char[ 6 ];
file.read( bytes, 6 );
auto small = *reinterpret_cast< std::int16_t* >( bytes );
auto large = *reinterpret_cast< std::int32_t* >( bytes + 2 );
This is rather simple.  We have a file that stores two values, one 16bit and one 32bit. Setting small should always work, because new will ways return memory aligned for at least all the basic types, or more depending on implementation.  The problem comes with trying to set large.  This leads to a few cases:
  1. It's aligned, if your system's base alignment is 16bits. (most likely not)
  2. It's unaligned, and your system's allows unaligned 32bit reads (impacting performance)
  3. It's unaligned, and your compiler generates appropriate code/instructions (impacting performance)
  4. It's unaligned, and your program crashes. (severely impacting performance)
The problem is worse, as it's probably type dependent.  On one system, all basic integer types don't require strict alignment, but float types do.  On another, only certain operations require strict alignment.  So, the moral is, always align your data.  One solution to the above is:
template< typename T, std::size_t S = sizeof( T ) >
void safe_read( T* dst, void* src )
{
  static_assert( !std::is_const< T >::value, "T cannot be constant" );
  static_assert( std::is_trivial< T >::value, "T must be trivial" );
  typedef std::array< char, S > raw_type;
  *reinterpret_cast< raw_type* >( dst ) = *reinterpret_cast< raw_type* >( src );
}

template< typename T, std::size_t S = sizeof( T ) >
T safe_read( void* src )
{
  T nrvo;
  safe_read< T, S >( &nrvo, src );
  return nrvo;
}

template< typename T, std::size_t S = sizeof( T ) >
T safe_read( T* src )
{
  T nrvo;
  safe_read< T, S >( &nrvo, src );
  return nrvo;
}

template< typename T, std::size_t S = sizeof( T ) >
void safe_write( T const & src, void* dst )
{
  static_assert( std::is_trivial< T >::value, "T must be trivial" );
  typedef std::array< char, S > raw_type;
  auto& s = *reinterpret_cast< raw_type const* >( &src );
  auto& d = *reinterpret_cast< raw_type* >( dst );
  d = s;
}
Then just change the assignment of large to:
auto large = safe_read< std::int32_t >( bytes + 2 );
I allowed overriding the size, for times when I know the real size of a padded type.  You should always read only the bytes you need, because reading into protected memory will cause a crash.  This is only good for trivial types.  If you have a anything other than a trivial type unaligned in memory, your run-time/standard library is broken, or you used placement new without taking into account the alignment for the type.  In case 3, from above, the compiler is most likely generating code identical to safe_read, unless your system has special instructions for unaligned reads.  If your system does have unaligned reads, the optimizer will replace safe_read with those instructions, win, win!


How do you get aligned memory?

Remember what I said before about new, here it is again from the standard (§3.7.4.1-2): The pointer returned shall be suitably aligned so that it can be converted to a pointer of any complete object type with a fundamental alignment requirement(3.11).  (§3.11): A fundamental alignment is represented by an alignment less than or equal to the greatest alignment supported by the implementation in all contexts, which is equal to alignof( std::max_align_t ).  If your compiler, like mine, doesn't define std::max_align_t this works:
#include <type_traits>
namespace std { typedef aligned_storage< 1 >::type max_align_t; }
This tells us something important about memory allocations, they are always done in chunks of alignof( std::max_align_t ).  Therefore any allocation smaller than the chunk size is wasting memory.  This is what makes memory pools so effective, all types of the same size are grouped close together with no wasted space between them.  Less wasted space, means less cache misses, mean better performance, means happy users.  Let's see what a simple solution looks like:
template< typename T >
T* align_pointer_up( T* ptr, std::size_t alignment ) noexcept
{
  auto bytes = reinterpret_cast< std::uintptr_t >( ptr );
  return reinterpret_cast< T* >(
        ( bytes + alignment - 1 ) / alignment * alignment );
}

void* get_aligned_memory( std::size_t size, std::size_t alignment )
{
  size += sizeof( void* ); // include the origin
  auto start = reinterpret_cast< char* >(
        ::operator new( size + alignment - 1 ) );
  auto user = align_pointer_up( start + sizeof( void* ), alignment );
  safe_write( start, user - sizeof( void* ) );
  return user;
}

void return_aligned_memory( void* ptr ) noexcept
{
  ::operator delete( safe_read( reinterpret_cast< void** >( ptr ) - 1 ) );
}
This is a good start, and you'll see versions of it a lot on the web.  It over allocates memory with enough space to place the return pointer in the requested alignment, plus space for a pointer back to the original pointer.  This works great, but it could be better.


If we know the alignment in both the get and return functions, we can avoid saving the original pointer whenever: alignment ≤ chunk size.  This works, because for each chunk, our alignment will appear at least once.  So we just need to move the pointer back to the beginning of the chuck to get the original pointer.
template< typename T >
T* align_pointer_down( T* ptr, std::size_t alignment ) noexcept
{
  auto bytes = reinterpret_cast< std::uintptr_t >( ptr );
  return reinterpret_cast< T* >( bytes / alignment * alignment );
}

constexpr auto max_align = alignof( std::max_align_t );

template< typename T >
T* get_aligned_buffer( std::size_t count )
{
  // include the origin if needed
  auto extra_bytes = alignof( T ) <= max_align ? 0 : sizeof( void* );
  auto size = sizeof( T ) * count + extra_bytes;
  auto start = reinterpret_cast< char* >(
        ::operator new( size + alignof( T ) - 1 ) );
  auto user = align_pointer_up( start + extra_bytes, alignof( T ) );
  if(extra_bytes) {
    safe_write( start, user - extra_bytes ); }
  return reinterpret_cast< T* >( user );
}

template< typename T >
void return_aligned_buffer( T* ptr ) noexcept
{
  if ( alignof( T ) <= max_align ) {
    ::operator delete( align_pointer_down( ptr, max_align ) );
  } else {
    ::operator delete( safe_read( reinterpret_cast< void** >( ptr ) - 1 ) );
  }
}


When to give up.

If your compiler doesn't support aligning the local stack variables with the alignment you require, give up, or find a new compiler.  This was a problem in the early days with SSE.  Trying to work with your data type only on the heap in aligned memory, is asking for trouble.  It might be possible, but the number of places for errors to creep in is amazing.

No comments:

Post a Comment