Sunday, March 31, 2013

Aligning Memory

What is alignment?

Let's say we have two sticks, one with tick marks every 4mm, and one with tick marks every 8mm.  When we put these two sticks next to each-other we find all the places they match up.
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
:   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :   |   :
:       :       :       :       :       :       :       :       :       :       :
Our two stick match at every 8mm.  That's pretty simple and strait forward, so let's try something evil.  Again with two sticks, one with marks every 3mm, and one with marks every 7mm.
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
:  |  |  |  |  |  |  :  |  |  |  |  |  |  :  |  |  |  |  |  |  :  |  |  |  |  |
:      |      |      :      |      |      :      |      |      :      |      |
Our sticks match at every 21mm, or align at 21mm, or have an alignment of 21mm.  Now we know what alignment is: the interval at which two or more intervals meet.  Which also happens to be mathematically identical to the Least Common Multiple.


How do you find a common alignment?

In our first example above, it was easy.  The second interval was a factor of the first interval, therefore the alignment was the same as the second interval.  In the second example we need to find the least common multiple of 3 and 7.  Here's how:
constexpr std::size_t
greatest_common_factor( std::size_t a ) noexcept
{
  return a;
}

constexpr std::size_t
greatest_common_factor( std::size_t a, std::size_t b ) noexcept
{
  return b ? greatest_common_factor( b, a % b ) : a;
}

template< typename ...Args >
constexpr std::size_t
greatest_common_factor( std::size_t a, std::size_t b,
                        std::size_t c, Args ... args) noexcept
{
  return greatest_common_factor( greatest_common_factor( a, b ), c, args... );
}

constexpr std::size_t
least_common_multiple( std::size_t a ) noexcept
{
  return a;
}

constexpr std::size_t
least_common_multiple( std::size_t a, std::size_t b ) noexcept
{
  return a / greatest_common_factor( a, b ) * b;
}

template< typename ...Args >
constexpr std::size_t
least_common_multiple( std::size_t a, std::size_t b,
                       std::size_t c, Args ... args) noexcept
{
  return least_common_multiple( least_common_multiple( a, b ), c, args... );
}
Where least_common_multiple( 3, 7 ) == 21.  So now that we know how to do arbitrary alignment, let's forget all about it.


What is memory alignment?

Memory alignment is in reference to a type of data, an integer, a float, etc., that has an interval at which it should be placed in memory.  An integer with an alignment of 2, placed at position 5, is unaligned, its position doesn't match its alignment interval.  If it was placed at position 6, or 8, or any multiple of its alignment, it would be aligned.


Typically data types, in a system, are all powers of 2.  This makes it so all types with weaker (smaller interval) alignment can be positioned near types with stricter (larger interval) alignment easily.
struct stuff
{
  std::int32_t a;
  std::int16_t b;
  std::int16_t c;
};
If every type in stuff has an alignment equal to its size, then how could stuff be laid out in memory?  What's stuff alignment?
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
aaaabbcc
aaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbcc
    aaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbccaaaabbcc

The alignment of stuff is 4. If std::int32_t had an alignment of 7, and std::int16_t had an alignment of 3, what would the alignment of stuff be?
          1         2         3         4         5         6         7         8
012345678901234567890123456789012345678901234567890123456789012345678901234567890
aaaaaaa  bbbccc      aaaaaaa  bbbccc      aaaaaaa  bbbccc      aaaaaaa  bbbccc
       aaaaaaa bbbccc       aaaaaaa bbbccc       aaaaaaa bbbccc
              aaaaaaabbbccc        aaaaaaabbbccc        aaaaaaabbbccc       

Depending on the padding between a and b, the origin, for where the alignment starts, changes.  The aligment of stuff would be 21, for each version with different padding, though.  Aren't powers of 2 much nicer?


How does memory alignment affect you?

As it turns out, keeping you data aligned is important.  SSE/AVX data needs to be properly aligned in memory to get the most benefit out of them.  Some times unaligned memory access is not allowed; ARMv4, ARMv5, and ARM device memory.  I'm not picking on ARM, they're just a more recent example.  In general you want all your data to be properly aligned all the time, and to copy any unaligned data into an aligned space before using it.


Thankfully the compiler is very good at keeping everything aligned by default.  Aggregate types all have corrected offsets for each member to maintain it's alignment, and padding at the end of the aggregate to keep all its members aligned when there is an array of the aggregate.  Problems creep in when we use non-basic types that have extended alignment requirements (SSE/AVX), or access memory directly.  For instance:
ifstream file( "example.bin", ios::in | ios::binary );
auto bytes = new char[ 6 ];
file.read( bytes, 6 );
auto small = *reinterpret_cast< std::int16_t* >( bytes );
auto large = *reinterpret_cast< std::int32_t* >( bytes + 2 );
This is rather simple.  We have a file that stores two values, one 16bit and one 32bit. Setting small should always work, because new will ways return memory aligned for at least all the basic types, or more depending on implementation.  The problem comes with trying to set large.  This leads to a few cases:
  1. It's aligned, if your system's base alignment is 16bits. (most likely not)
  2. It's unaligned, and your system's allows unaligned 32bit reads (impacting performance)
  3. It's unaligned, and your compiler generates appropriate code/instructions (impacting performance)
  4. It's unaligned, and your program crashes. (severely impacting performance)
The problem is worse, as it's probably type dependent.  On one system, all basic integer types don't require strict alignment, but float types do.  On another, only certain operations require strict alignment.  So, the moral is, always align your data.  One solution to the above is:
template< typename T, std::size_t S = sizeof( T ) >
void safe_read( T* dst, void* src )
{
  static_assert( !std::is_const< T >::value, "T cannot be constant" );
  static_assert( std::is_trivial< T >::value, "T must be trivial" );
  typedef std::array< char, S > raw_type;
  *reinterpret_cast< raw_type* >( dst ) = *reinterpret_cast< raw_type* >( src );
}

template< typename T, std::size_t S = sizeof( T ) >
T safe_read( void* src )
{
  T nrvo;
  safe_read< T, S >( &nrvo, src );
  return nrvo;
}

template< typename T, std::size_t S = sizeof( T ) >
T safe_read( T* src )
{
  T nrvo;
  safe_read< T, S >( &nrvo, src );
  return nrvo;
}

template< typename T, std::size_t S = sizeof( T ) >
void safe_write( T const & src, void* dst )
{
  static_assert( std::is_trivial< T >::value, "T must be trivial" );
  typedef std::array< char, S > raw_type;
  auto& s = *reinterpret_cast< raw_type const* >( &src );
  auto& d = *reinterpret_cast< raw_type* >( dst );
  d = s;
}
Then just change the assignment of large to:
auto large = safe_read< std::int32_t >( bytes + 2 );
I allowed overriding the size, for times when I know the real size of a padded type.  You should always read only the bytes you need, because reading into protected memory will cause a crash.  This is only good for trivial types.  If you have a anything other than a trivial type unaligned in memory, your run-time/standard library is broken, or you used placement new without taking into account the alignment for the type.  In case 3, from above, the compiler is most likely generating code identical to safe_read, unless your system has special instructions for unaligned reads.  If your system does have unaligned reads, the optimizer will replace safe_read with those instructions, win, win!


How do you get aligned memory?

Remember what I said before about new, here it is again from the standard (§3.7.4.1-2): The pointer returned shall be suitably aligned so that it can be converted to a pointer of any complete object type with a fundamental alignment requirement(3.11).  (§3.11): A fundamental alignment is represented by an alignment less than or equal to the greatest alignment supported by the implementation in all contexts, which is equal to alignof( std::max_align_t ).  If your compiler, like mine, doesn't define std::max_align_t this works:
#include <type_traits>
namespace std { typedef aligned_storage< 1 >::type max_align_t; }
This tells us something important about memory allocations, they are always done in chunks of alignof( std::max_align_t ).  Therefore any allocation smaller than the chunk size is wasting memory.  This is what makes memory pools so effective, all types of the same size are grouped close together with no wasted space between them.  Less wasted space, means less cache misses, mean better performance, means happy users.  Let's see what a simple solution looks like:
template< typename T >
T* align_pointer_up( T* ptr, std::size_t alignment ) noexcept
{
  auto bytes = reinterpret_cast< std::uintptr_t >( ptr );
  return reinterpret_cast< T* >(
        ( bytes + alignment - 1 ) / alignment * alignment );
}

void* get_aligned_memory( std::size_t size, std::size_t alignment )
{
  size += sizeof( void* ); // include the origin
  auto start = reinterpret_cast< char* >(
        ::operator new( size + alignment - 1 ) );
  auto user = align_pointer_up( start + sizeof( void* ), alignment );
  safe_write( start, user - sizeof( void* ) );
  return user;
}

void return_aligned_memory( void* ptr ) noexcept
{
  ::operator delete( safe_read( reinterpret_cast< void** >( ptr ) - 1 ) );
}
This is a good start, and you'll see versions of it a lot on the web.  It over allocates memory with enough space to place the return pointer in the requested alignment, plus space for a pointer back to the original pointer.  This works great, but it could be better.


If we know the alignment in both the get and return functions, we can avoid saving the original pointer whenever: alignment ≤ chunk size.  This works, because for each chunk, our alignment will appear at least once.  So we just need to move the pointer back to the beginning of the chuck to get the original pointer.
template< typename T >
T* align_pointer_down( T* ptr, std::size_t alignment ) noexcept
{
  auto bytes = reinterpret_cast< std::uintptr_t >( ptr );
  return reinterpret_cast< T* >( bytes / alignment * alignment );
}

constexpr auto max_align = alignof( std::max_align_t );

template< typename T >
T* get_aligned_buffer( std::size_t count )
{
  // include the origin if needed
  auto extra_bytes = alignof( T ) <= max_align ? 0 : sizeof( void* );
  auto size = sizeof( T ) * count + extra_bytes;
  auto start = reinterpret_cast< char* >(
        ::operator new( size + alignof( T ) - 1 ) );
  auto user = align_pointer_up( start + extra_bytes, alignof( T ) );
  if(extra_bytes) {
    safe_write( start, user - extra_bytes ); }
  return reinterpret_cast< T* >( user );
}

template< typename T >
void return_aligned_buffer( T* ptr ) noexcept
{
  if ( alignof( T ) <= max_align ) {
    ::operator delete( align_pointer_down( ptr, max_align ) );
  } else {
    ::operator delete( safe_read( reinterpret_cast< void** >( ptr ) - 1 ) );
  }
}


When to give up.

If your compiler doesn't support aligning the local stack variables with the alignment you require, give up, or find a new compiler.  This was a problem in the early days with SSE.  Trying to work with your data type only on the heap in aligned memory, is asking for trouble.  It might be possible, but the number of places for errors to creep in is amazing.

Wednesday, March 27, 2013

make_unique (Part 1)

So I went to Herb Sutter's page after watching some videos about  C++11.  This post mentions make_unique, and I wanted to see how far I could take it.  My code is a little more spaced out, so that when the syntax highlighter in my IDE barfs, I can still read it rather easily.  I also prefer the new suffix return syntax for any type longer than six characters, it keeps all the function names lined up on the left.


Part 1 - The Basics

The original version from the post does perfect forwarding of the function parameters to the constructor. No temporaries are made, unless the constructor has pass-by-value parameters, and it keeps the user from calling new, which can produce leaks in the face of unordered execution with exceptions.
template< typename T, typename ...Args >
std::unique_ptr< T > make_unique( Args&& ...args )
{
  return std::unique_ptr< T >( new T( std::forward< Args >( args )... ) );
}
This works for basic objects, until our friend the array comes into play.  There are two types.
First, fixed size arrays: int[ 3 ]
  1. complete type
  2. cannot be copied or assigned to directly
  3. decays to a pointer of the sub-type
  4. addressable (can pass by reference or address)
The last two bullets are the ones that we care about.  I hadn't really thought about number 4, because in general number 3 is what is typically normal, but here's an example.
// don't do this
int bad( int    a  [ 3 ] ) { return sizeof( a ); }
int bad( int ( &a )[ 3 ] ) { return sizeof( a ); }
// do this
int good( int   *a        ) { return sizeof( a ); }
int good( int ( *a )[ 4 ] ) { return sizeof( a ); }

void test()
{
  int a[ 4 ] = {};
  good( a );
  good( &a );
  // bad( a ); // ambiguous, can decay to a pointer or pass by reference
  auto b = a; // b is an int*
  good( b );
  bad( b ); // calls the first bad() due to decayed parameter type
}
Each function above always returns the size of a pointer, because once a enters the function it immediately decays.  clang even warns us about the first bad calling sizeof( a ).
Next, variable size arrays: int[]
  1. incomplete type (only useful in a template parameter)
  2. decays to a pointer of the sub-type
  3. ???
Example:
template< typename T >
auto f( T val ) -> std::pair< bool, bool >
{
  return {
      std::is_array< T >::value,
      std::is_array< decltype( val ) >::value };
}

void test()
{
  int a[ 4 ] = {};
  auto b = f( a );
  auto c = f< int[] >( a );
}
In the above b = { false, false } and c = { true, false }, because f( a ) calls f< int* >( int* val ) and f< int[] >(a) calls f< int[] >( int* val ). Now with that out of the way let's get on with fixing up make_unique.


Part 2 - The Easy Stuff

Let's fix up the original so that it doesn't bother with arrays:
template< typename T, typename ...Args,
          typename = typename std::enable_if<
            !std::is_array< T >::value >::type >
auto make_unique( Args&& ... args ) -> std::unique_ptr< T >
{
  return std::unique_ptr< T >( new T( std::forward< Args >( args )... ) );
}
Notice we don't have to use std::enable_if
We have a bit of a problem with arrays, because we need to know the size of the array to make.  Run-time sized arrays need the user to pass in the size, like so:
// similar to: string val[3] = { "hello", "world" }
make_unique< string[] >( 3, "hello", "world" );
// similar to: string val[1] = { "hello", "world" }
make_unique< string[] >( 1, "hello", "world" ); // throws std::bad_alloc
I'm not really fond of this because it makes the run-time sized version require an extra parameter, which mismatches with how it acts for other types.  For fixed size there's no problem:
// similar to: string val[3] = { "hello", "world" }
make_unique< string[3] >( "hello", "world" );
// similar to: string val[1] = { "hello", "world" }
make_unique< string[1] >( "hello", "world" ); // throws std::bad_alloc
// similar to: string val[] = { "hello", "world" }
make_unique< string[] >( "hello", "world" ); // allow auto sizing, implicit size
It looks like you would expect, so I propose only allowing fixed size arrays with make_unique, and add a new function just for run-time sized arrays called make_unique_array, similar to the first version of make_unique with run-time sized arrays.
template< typename T, typename S, typename... Args >
auto make_unique_array(S size, Args&&... args ) -> std::unique_ptr< T[] >
{
  static_assert( !std::is_array< T >::value, "T cannot be an array" );
  return std::unique_ptr< T[] >
      ( new T[ size ]{ std::forward< Args >( args )... } );
}
We don't allow arrays of arrays because there is no way to forward the actual initializer list into the call site of new.  Equally, we cannot forward a list of arrays and have new use it as expected.  I'll get to multirank arrays next time.  So now we have make_unique support for flat arrays.
template< typename T, typename... Args,
          typename U = typename std::remove_extent< T >::type,
          typename = typename std::enable_if<
            std::is_array< T >::value >::type >
auto make_unique( Args&&... args ) ->  std::unique_ptr< U[] >
{
  constexpr auto size = std::extent< T >::value;

  return make_unique_array< U >
      ( size ? size : sizeof...( Args ), std::forward< Args >( args )... );
}
It's a shame the language doesn't allow complete transparent argument forwarding, but maybe it will in the future, C++2x?.


Part 3 - The (updated) End (maybe?)

I found a proposal for the next C++ revision, and a discussion thread.  So I made some changes. Here's a new interface, one that I proposed.
make_unique< T >( default_init )      // new T
make_unique< T >( args... )           // new T( args... )
make_unique_from_list< T >( args... ) // new T{ args... }; got a better name?
make_unique< T[ N ] >( default_init ) // new T[ N ]
make_unique< T[ N ] >( args... )      // new T[ N ]{ args... }
make_unique< T[] >( args... )         // new T[ sizeof...( args ) ]{ args... }
make_unique_array< T >( n, default_init )    // new T[ n ]
make_unique_array< T >( n, args... )         // new T[ n ]{ args... }
make_unique_array< T >( auto_size, args... ) // new T[ sizeof...( args ) ]{ args... }
I actually think this is pretty close to what I want, but make_unique, just like make_shared, feels like it really should only be for single objects.  I'm beginning to lean towards creating a pair of new classes just for arrays, unique_array and shared_array, similar to std::array.  I want unique_array to have same overhead as a native array, exactly one pointer, but with the functionality of a container.  For shared_array, I'd like to imbed the array in the control block to reduce the number of indirections, just like make_shared does.  So look forward to that in the future.

The old code, which I put under the most liberal license I could find, is here for the taking.  Use it however you want, but don't sue me when your space station crashes to earth...
// Copyright 2013 Paul A. Tessier
//
// Licensed under the Open Source Initiative - MIT License;
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://opensource.org/licenses/mit-license.php
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
// WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#if __cplusplus != 201103L
#error requires C++11 support
#endif

#include <memory>

struct default_init_t { } default_init;

struct auto_size_t { } auto_size;

template< typename T, typename ...Args,
          typename = typename std::enable_if<
            !std::is_array< T >::value >::type >
auto make_unique( Args&& ... args ) -> std::unique_ptr< T >
{
  return std::unique_ptr< T >( new T( std::forward< Args >( args )... ) );
}

template< typename T, typename ...Args,
          typename = typename std::enable_if<
            !std::is_array< T >::value >::type >
auto make_unique( default_init_t ) -> std::unique_ptr< T >
{
  return std::unique_ptr< T >( new T );
}

template< typename T, typename ...Args,
          typename = typename std::enable_if<
            !std::is_array< T >::value >::type >
auto make_unique_from_list( Args&& ... args ) -> std::unique_ptr< T >
{
  return std::unique_ptr< T >( new T{ std::forward< Args >( args )... } );
}

template< typename T, typename... Args >
auto make_unique_array(std::size_t size, Args&&... args )
-> std::unique_ptr< T[] >
{
  static_assert(!std::is_array< T >::value || !sizeof...( Args ),
                "cannot initialize an array of arrays");
  return std::unique_ptr< T[] >
      ( new T[ size ]{ std::forward< Args >( args )... } );
}

template< typename T, typename... Args >
auto make_unique_array(auto_size_t, Args&&... args )-> std::unique_ptr< T[] >
{
  return make_unique_array< T >
      ( sizeof...( Args ), std::forward< Args >( args )... );
}

template< typename T, typename... Args >
auto make_unique_array(std::size_t size, default_init_t )
-> std::unique_ptr< T[] >
{
  return std::unique_ptr< T[] >( new T[ size ] );
}

template< typename T, typename... Args,
          typename U = typename std::remove_extent< T >::type,
          typename = typename std::enable_if< std::is_array< T >::value >::type
          >
auto make_unique( Args&&... args ) ->  std::unique_ptr< U[] >
{
  constexpr auto size = std::extent< T >::value;

  return make_unique_array< U >
      ( size ? size : sizeof...( Args ), std::forward< Args >( args )... );
}