Writing single-node HPX applications

HPX is a C++ Standard Library for Concurrency and Parallelism. This means that it implements all of the corresponding facilities as defined by the C++ Standard. Additionally, in HPX we implement functionalities proposed as part of the ongoing C++ standardization process. This section focuses on the features available in HPX for parallel and concurrent computation on a single node, although many of the features presented here are also implemented to work in the distributed case.

Using LCOs

Lightweight Control Objects provide synchronization for HPX applications. Most of them are familiar from other frameworks, but a few of them work in slightly special different ways adapted to HPX.

  1. future

  2. queue

  3. object_semaphore

  4. barrier

Channels

Channels combine communication (the exchange of a value) with synchronization (guaranteeing that two calculations (tasks) are in a known state). A channel can transport any number of values of a given type from a sender to a receiver:

hpx::lcos::local::channel<int> c;
c.set(42);
cout << c.get();      // will print '42'

Channels can be handed to another thread (or in case of channel components, to other localities), thus establishing a communication channel between two independent places in the program:

void do_something(
    hpx::lcos::local::receive_channel<int> c,
    hpx::lcos::local::send_channel<> done)
{
    cout << c.get();        // prints 42
    done.set();             // signal back
}

{
    hpx::lcos::local::channel<int> c;
    hpx::lcos::local::channel<> done;

    hpx::apply(&do_something, c, done);

    c.set(42);              // send some value
    done.get();             // wait for thread to be done
}

A channel component is created on one locality and can be send to another locality using an action. This example also demonstrates how a channel can be used as a range of values:

// channel components need to be registered for each used type (not needed
// for hpx::lcos::local::channel)
HPX_REGISTER_CHANNEL(double);

void some_action(hpx::lcos::channel<double> c)
{
    for (double d : c)
        hpx::cout << d << std::endl;
}
HPX_REGISTER_ACTION(some_action);

{
    // create the channel on this locality
    hpx::lcos::channel<double> c(hpx::find_here());

    // pass the channel to a (possibly remote invoked) action
    hpx::apply(some_action(), hpx::find_here(), c);

    // send some values to the receiver
    std::vector<double> v = { 1.2, 3.4, 5.0 };
    for (double d : v)
        c.set(d);

    // explicitly close the communication channel (implicit at destruction)
    c.close();
}

Composable guards

Composable guards operate in a manner similar to locks, but are applied only to asynchronous functions. The guard (or guards) is automatically locked at the beginning of a specified task and automatically unlocked at the end. Because guards are never added to an existing task’s execution context, the calling of guards is freely composable and can never deadlock.

To call an application with a single guard, simply declare the guard and call run_guarded() with a function (task):

hpx::lcos::local::guard gu;
run_guarded(gu,task);

If a single method needs to run with multiple guards, use a guard set:

boost::shared<hpx::lcos::local::guard> gu1(new hpx::lcos::local::guard());
boost::shared<hpx::lcos::local::guard> gu2(new hpx::lcos::local::guard());
gs.add(*gu1);
gs.add(*gu2);
run_guarded(gs,task);

Guards use two atomic operations (which are not called repeatedly) to manage what they do, so overhead should be extremely low.

  1. conditional_trigger

  2. counting_semaphore

  3. dataflow

  4. event

  5. mutex

  6. once

  7. recursive_mutex

  8. spinlock

  9. spinlock_no_backoff

  10. trigger

Extended facilities for futures

Concurrency is about both decomposing and composing the program from the parts that work well individually and together. It is in the composition of connected and multicore components where today’s C++ libraries are still lacking.

The functionality of std::future offers a partial solution. It allows for the separation of the initiation of an operation and the act of waiting for its result; however the act of waiting is synchronous. In communication-intensive code this act of waiting can be unpredictable, inefficient and simply frustrating. The example below illustrates a possible synchronous wait using futures:

#include <future>
using namespace std;
int main()
{
    future<int> f = async([]() { return 123; });
    int result = f.get(); // might block
}

For this reason, HPX implements a set of extensions to std::future (as proposed by __cpp11_n4107__). This proposal introduces the following key asynchronous operations to hpx::future, hpx::shared_future and hpx::async, which enhance and enrich these facilities.

Table 23 Facilities extending std::future

Facility

Description

hpx::future::then

In asynchronous programming, it is very common for one asynchronous operation, on completion, to invoke a second operation and pass data to it. The current C++ standard does not allow one to register a continuation to a future. With then instead of waiting for the result, a continuation is “attached” to the asynchronous operation, which is invoked when the result is ready. Continuations registered using then function will help to avoid blocking waits or wasting threads on polling, greatly improving the responsiveness and scalability of an application.

unwrapping constructor for hpx::future

In some scenarios, you might want to create a future that returns another future, resulting in nested futures. Although it is possible to write code to unwrap the outer future and retrieve the nested future and its result, such code is not easy to write because you must handle exceptions and it may cause a blocking call. Unwrapping can allow us to mitigate this problem by doing an asynchronous call to unwrap the outermost future.

hpx::future::is_ready

There are often situations where a get() call on a future may not be a blocking call, or is only a blocking call under certain circumstances. This function gives the ability to test for early completion and allows us to avoid associating a continuation, which needs to be scheduled with some non-trivial overhead and near-certain loss of cache efficiency.

hpx::make_ready_future

Some functions may know the value at the point of construction. In these cases the value is immediately available, but needs to be returned as a future. By using hpx::make_ready_future a future can be created which holds a pre-computed result in its shared state. In the current standard it is non-trivial to create a future directly from a value. First a promise must be created, then the promise is set, and lastly the future is retrieved from the promise. This can now be done with one operation.

The standard also omits the ability to compose multiple futures. This is a common pattern that is ubiquitous in other asynchronous frameworks and is absolutely necessary in order to make C++ a powerful asynchronous programming language. Not including these functions is synonymous to Boolean algebra without AND/OR.

In addition to the extensions proposed by N4313, HPX adds functions allowing to compose several futures in a more flexible way.

Table 24 Facilities for composing hpx::futures

Facility

Description

Comment

hpx::when_any, hpx::when_any_n

Asynchronously wait for at least one of multiple future or shared_future objects to finish.

N4313, ..._n versions are HPX only

hpx::wait_any, hpx::wait_any_n

Synchronously wait for at least one of multiple future or shared_future objects to finish.

HPX only

hpx::when_all, hpx::when_all_n

Asynchronously wait for all future and shared_future objects to finish.

N4313, ..._n versions are HPX only

hpx::wait_all, hpx::wait_all_n

Synchronously wait for all future and shared_future objects to finish.

HPX only

hpx::when_some, hpx::when_some_n

Asynchronously wait for multiple future and shared_future objects to finish.

HPX only

hpx::wait_some, hpx::wait_some_n

Synchronously wait for multiple future and shared_future objects to finish.

HPX only

hpx::when_each

Asynchronously wait for multiple future and shared_future objects to finish and call a function for each of the future objects as soon as it becomes ready.

HPX only

hpx::wait_each, hpx::wait_each_n

Synchronously wait for multiple future and shared_future objects to finish and call a function for each of the future objects as soon as it becomes ready.

HPX only

High level parallel facilities

In preparation for the upcoming C++ Standards we currently see several proposals targeting different facilities supporting parallel programming. HPX implements (and extends) some of those proposals. This is well aligned with our strategy to align the APIs exposed from HPX with current and future C++ Standards.

At this point, HPX implements several of the C++ Standardization working papers, most notably N4409 (Working Draft, Technical Specification for C++ Extensions for Parallelism), N4411 (Task Blocks), and N4406 (Parallel Algorithms Need Executors).

Using parallel algorithms

A parallel algorithm is a function template described by this document which is declared in the (inline) namespace hpx::parallel::v1.

Note

For compilers which do not support inline namespaces, all of the namespace v1 is imported into the namespace hpx::parallel. The effect is similar to what inline namespaces would do, namely all names defined in hpx::parallel::v1 are accessible from the namespace hpx::parallel as well.

All parallel algorithms are very similar in semantics to their sequential counterparts (as defined in the namespace std) with an additional formal template parameter named ExecutionPolicy. The execution policy is generally passed as the first argument to any of the parallel algorithms and describes the manner in which the execution of these algorithms may be parallelized and the manner in which they apply user-provided function objects.

The applications of function objects in parallel algorithms invoked with an execution policy object of type hpx::parallel::execution::sequenced_policy or hpx::parallel::execution::sequenced_task_policy execute in sequential order. For hpx::parallel::execution::sequenced_policy the execution happens in the calling thread.

The applications of function objects in parallel algorithms invoked with an execution policy object of type hpx::parallel::execution::parallel_policy or hpx::parallel::execution::parallel_task_policy are permitted to execute in an unordered fashion in unspecified threads, and indeterminately sequenced within each thread.

Important

It is the caller’s responsibility to ensure correctness, for example that the invocation does not introduce data races or deadlocks.

The applications of function objects in parallel algorithms invoked with an execution policy of type hpx::parallel::execution::parallel_unsequenced_policy is in HPX equivalent to the use of the execution policy hpx::parallel::execution::parallel_policy.

Algorithms invoked with an execution policy object of type hpx::parallel::v1::execution_policy execute internally as if invoked with the contained execution policy object. No exception is thrown when an hpx::parallel::v1::execution_policy contains an execution policy of type hpx::parallel::execution::sequenced_task_policy or hpx::parallel::execution::parallel_task_policy (which normally turn the algorithm into its asynchronous version). In this case the execution is semantically equivalent to the case of passing a hpx::parallel::execution::sequenced_policy or hpx::parallel::execution::parallel_policy contained in the hpx::parallel::v1::execution_policy object respectively.

Parallel exceptions

During the execution of a standard parallel algorithm, if temporary memory resources are required by any of the algorithms and no memory is available, the algorithm throws a std::bad_alloc exception.

During the execution of any of the parallel algorithms, if the application of a function object terminates with an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:

For example, the number of invocations of the user-provided function object in for_each is unspecified. When hpx::parallel::v1::for_each is executed sequentially, only one exception will be contained in the hpx::exception_list object.

These guarantees imply that, unless the algorithm has failed to allocate memory and terminated with std::bad_alloc, all exceptions thrown during the execution of the algorithm are communicated to the caller. It is unspecified whether an algorithm implementation will “forge ahead” after encountering and capturing a user exception.

The algorithm may terminate with the std::bad_alloc exception even if one or more user-provided function objects have terminated with an exception. For example, this can happen when an algorithm fails to allocate memory while creating or adding elements to the hpx::exception_list object.

Parallel algorithms

HPX provides implementations of the following parallel algorithms:

Table 25 Non-modifying parallel algorithms (in header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::adjacent_find

Computes the differences between adjacent elements in a range.

<hpx/include/parallel_adjacent_find.hpp>

adjacent_find

hpx::parallel::v1::all_of

Checks if a predicate is true for all of the elements in a range.

<hpx/include/parallel_all_any_none.hpp>

all_any_none_of

hpx::parallel::v1::any_of

Checks if a predicate is true for any of the elements in a range.

<hpx/include/parallel_all_any_none.hpp>

all_any_none_of

hpx::parallel::v1::count

Returns the number of elements equal to a given value.

<hpx/include/parallel_count.hpp>

count

hpx::parallel::v1::count_if

Returns the number of elements satisfying a specific criteria.

<hpx/include/parallel_count.hpp>

count_if

hpx::parallel::v1::equal

Determines if two sets of elements are the same.

<hpx/include/parallel_equal.hpp>

equal

hpx::parallel::v1::exclusive_scan

Does an exclusive parallel scan over a range of elements.

<hpx/include/parallel_scan.hpp>

exclusive_scan

hpx::parallel::v1::find

Finds the first element equal to a given value.

<hpx/include/parallel_find.hpp>

find

hpx::parallel::v1::find_end

Finds the last sequence of elements in a certain range.

<hpx/include/parallel_find.hpp>

find_end

hpx::parallel::v1::find_first_of

Searches for any one of a set of elements.

<hpx/include/parallel_find.hpp>

find_first_of

hpx::parallel::v1::find_if

Finds the first element satisfying a specific criteria.

<hpx/include/parallel_find.hpp>

find

hpx::parallel::v1::find_if_not

Finds the first element not satisfying a specific criteria.

<hpx/include/parallel_find.hpp>

find_if_not

hpx::parallel::v1::for_each

Applies a function to a range of elements.

<hpx/include/parallel_for_each.hpp>

for_each

hpx::parallel::v1::for_each_n

Applies a function to a number of elements.

<hpx/include/parallel_for_each.hpp>

for_each_n

hpx::parallel::v1::inclusive_scan

Does an inclusive parallel scan over a range of elements.

<hpx/include/parallel_scan.hpp>

inclusive_scan

hpx::parallel::v1::lexicographical_compare

Checks if a range of values is lexicographically less than another range of values.

<hpx/include/parallel_lexicographical_compare.hpp>

lexicographical_compare

hpx::parallel::v1::mismatch

Finds the first position where two ranges differ.

<hpx/include/parallel_mismatch.hpp>

mismatch

hpx::parallel::v1::none_of

Checks if a predicate is true for none of the elements in a range.

<hpx/include/parallel_all_any_none.hpp>

all_any_none_of

hpx::parallel::v1::search

Searches for a range of elements.

<hpx/include/parallel_search.hpp>

search

hpx::parallel::v1::search_n

Searches for a number consecutive copies of an element in a range.

<hpx/include/parallel_search.hpp>

search_n

Table 26 Modifying Parallel Algorithms (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::copy

Copies a range of elements to a new location.

<hpx/include/parallel_copy.hpp>

exclusive_scan

hpx::parallel::v1::copy_n

Copies a number of elements to a new location.

<hpx/include/parallel_copy.hpp>

copy_n

hpx::parallel::v1::copy_if

Copies the elements from a range to a new location for which the given predicate is true

<hpx/include/parallel_copy.hpp>

copy

hpx::parallel::v1::move

Moves a range of elements to a new location.

<hpx/include/parallel_fill.hpp>

move

hpx::parallel::v1::fill

Assigns a range of elements a certain value.

<hpx/include/parallel_fill.hpp>

fill

hpx::parallel::v1::fill_n

Assigns a value to a number of elements.

<hpx/include/parallel_fill.hpp>

fill_n

hpx::parallel::v1::generate

Saves the result of a function in a range.

<hpx/include/parallel_generate.hpp>

generate

hpx::parallel::v1::generate_n

Saves the result of N applications of a function.

<hpx/include/parallel_generate.hpp>

generate_n

hpx::parallel::v1::remove

Removes the elements from a range that are equal to the given value.

<hpx/include/parallel_remove.hpp>

remove

hpx::parallel::v1::remove_if

Removes the elements from a range that are equal to the given predicate is false

<hpx/include/parallel_remove.hpp>

remove

hpx::parallel::v1::remove_copy

Copies the elements from a range to a new location that are not equal to the given value.

<hpx/include/parallel_remove_copy.hpp>

remove_copy

hpx::parallel::v1::remove_copy_if

Copies the elements from a range to a new location for which the given predicate is false

<hpx/include/parallel_remove_copy.hpp>

remove_copy

hpx::parallel::v1::replace

Replaces all values satisfying specific criteria with another value.

<hpx/include/parallel_replace.hpp>

replace

hpx::parallel::v1::replace_if

Replaces all values satisfying specific criteria with another value.

<hpx/include/parallel_replace.hpp>

replace

hpx::parallel::v1::replace_copy

Copies a range, replacing elements satisfying specific criteria with another value.

<hpx/include/parallel_replace.hpp>

replace_copy

hpx::parallel::v1::replace_copy_if

Copies a range, replacing elements satisfying specific criteria with another value.

<hpx/include/parallel_replace.hpp>

replace_copy

hpx::parallel::v1::reverse

Reverses the order elements in a range.

<hpx/include/parallel_reverse.hpp>

reverse

hpx::parallel::v1::reverse_copy

Creates a copy of a range that is reversed.

<hpx/include/parallel_reverse.hpp>

reverse_copy

hpx::parallel::v1::rotate

Rotates the order of elements in a range.

<hpx/include/parallel_rotate.hpp>

rotate

hpx::parallel::v1::rotate_copy

Copies and rotates a range of elements.

<hpx/include/parallel_rotate.hpp>

rotate_copy

hpx::parallel::v1::swap_ranges

Swaps two ranges of elements.

<hpx/include/parallel_swap_ranges.hpp>

swap_ranges

hpx::parallel::v1::transform

Applies a function to a range of elements.

<hpx/include/parallel_transform.hpp>

transform

hpx::parallel::v1::unique_copy

Eliminates all but the first element from every consecutive group of equivalent elements from a range.

<hpx/include/parallel_unique.hpp>

unique

hpx::parallel::v1::unique_copy

Eliminates all but the first element from every consecutive group of equivalent elements from a range.

<hpx/include/parallel_unique.hpp>

unique_copy

Table 27 Set operations on sorted sequences (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::merge

Merges two sorted ranges.

<hpx/include/parallel_merge.hpp>

merge

hpx::parallel::v1::inplace_merge

Merges two ordered ranges in-place.

<hpx/include/parallel_merge.hpp>

inplace_merge

hpx::parallel::v1::includes

Returns true if one set is a subset of another.

<hpx/include/parallel_set_operations.hpp>

includes

hpx::parallel::v1::set_difference

Computes the difference between two sets.

<hpx/include/parallel_set_operations.hpp>

set_difference

hpx::parallel::v1::set_intersection

Computes the intersection of two sets.

<hpx/include/parallel_set_operations.hpp>

set_intersection

hpx::parallel::v1::set_symmetric_difference

Computes the symmetric difference between two sets.

<hpx/include/parallel_set_operations.hpp>

set_symmetric_difference

hpx::parallel::v1::set_union

Computes the union of two sets.

<hpx/include/parallel_set_operations.hpp>

set_union

Table 28 Heap operations (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::is_heap

Returns true if the range is max heap.

<hpx/include/is_heap.hpp>

is_heap

hpx::parallel::v1::is_heap_until

Returns the first element that breaks a max heap.

<hpx/include/is_heap.hpp>

is_heap_until

Table 29 Minimum/maximum operations (In Header: <hpx/include/parallel_algortithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::max_element

Returns the largest element in a range.

<hpx/include/parallel_minmax.hpp>

max_element

hpx::parallel::v1::min_element

Returns the smallest element in a range.

<hpx/include/parallel_minmax.hpp>

min_element

hpx::parallel::v1::minmax_element

Returns the smallest and the largest element in a range.

<hpx/include/parallel_minmax.hpp>

minmax_element

Table 30 Partitioning Operations (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::is_partitioned

Returns true if each true element for a predicate precedes the false elements in a range

<hpx/include/parallel_is_partitioned.hpp>

is_partitioned

hpx::parallel::v1::partition

Divides elements into two groups while don’t preserve their relative order

<hpx/include/parallel_partition.hpp>

partition

hpx::parallel::v1::partition_copy

Copies a range dividing the elements into two groups

<hpx/include/parallel_partition.hpp>

partition_copy

hpx::parallel::v1::stable_partition

Divides elements into two groups while preserving their relative order

<hpx/include/parallel_partition.hpp>

stable_partition

Table 31 Sorting Operations (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::is_sorted

Returns true if each element in a range is sorted

<hpx/include/parallel_is_sorted.hpp>

is_sorted

hpx::parallel::v1::is_sorted_until

Returns the first unsorted element

<hpx/include/parallel_is_sorted.hpp>

is_sorted_until

hpx::parallel::v1::sort

Sorts the elements in a range

<hpx/include/parallel_sort.hpp>

sort

hpx::parallel::v1::sort_by_key

Sorts one range of data using keys supplied in another range

<hpx/include/parallel_sort.hpp>

Table 32 Numeric Parallel Algorithms (In Header: <hpx/include/parallel_numeric.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::adjacent_difference

Calculates the difference between each element in an input range and the preceding element.

<hpx/include/parallel_adjacent_difference.hpp>

adjacent_difference

hpx::parallel::v1::reduce

Sums up a range of elements.

<hpx/include/parallel_reduce.hpp>

reduce

hpx::parallel::v1::reduce_by_key

Performs an inclusive scan on consecutive elements with matching keys, with a reduction to output only the final sum for each key. The key sequence {1,1,1,2,3,3,3,3,1} and value sequence {2,3,4,5,6,7,8,9,10} would be reduced to keys={1,2,3,1}, values={9,5,30,10}

<hpx/include/parallel_reduce.hpp>

hpx::parallel::v1::transform_reduce

Sums up a range of elements after applying a function. Also, accumulates the inner products of two input ranges.

<hpx/include/parallel_transform_reduce.hpp>

transform_reduce

hpx::parallel::v1::transform_inclusive_scan

Does an inclusive parallel scan over a range of elements after applying a function.

<hpx/include/parallel_scan.hpp>

transform_inclusive_scan

hpx::parallel::v1::transform_exclusive_scan

Does an exclusive parallel scan over a range of elements after applying a function.

<hpx/include/parallel_scan.hpp>

transform_exclusive_scan

Table 33 Dynamic Memory Management (In Header: <hpx/include/parallel_memory.hpp>)

Name

Description

In header

Algorithm page at cppreference.com

hpx::parallel::v1::destroy

Destroys a range of objects.

<hpx/include/parallel_destroy.hpp>

destroy

hpx::parallel::v1::destroy_n

Destroys a range of objects.

<hpx/include/parallel_destroy.hpp>

destroy_n

hpx::parallel::v1::uninitialized_copy

Copies a range of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_copy.hpp>

uninitialized_copy

hpx::parallel::v1::uninitialized_copy_n

Copies a number of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_copy.hpp>

uninitialized_copy_n

hpx::parallel::v1::uninitialized_default_construct

Copies a range of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_default_construct.hpp>

uninitialized_default_construct

hpx::parallel::v1::uninitialized_default_construct_n

Copies a number of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_default_construct.hpp>

uninitialized_default_construct_n

hpx::parallel::v1::uninitialized_fill

Copies an object to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_fill.hpp>

uninitialized_fill

hpx::parallel::v1::uninitialized_fill_n

Copies an object to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_fill.hpp>

uninitialized_fill_n

hpx::parallel::v1::uninitialized_move

Moves a range of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_move.hpp>

uninitialized_move

hpx::parallel::v1::uninitialized_move_n

Moves a number of objects to an uninitialized area of memory.

<hpx/include/parallel_uninitialized_move.hpp>

uninitialized_move_n

hpx::parallel::v1::uninitialized_value_construct

Constructs objects in an uninitialized area of memory.

<hpx/include/parallel_uninitialized_value_construct.hpp>

uninitialized_value_construct

hpx::parallel::v1::uninitialized_value_construct_n

Constructs objects in an uninitialized area of memory.

<hpx/include/uninitialized_value_construct.hpp>

uninitialized_value_construct_n

Table 34 Index-based for-loops (In Header: <hpx/include/parallel_algorithm.hpp>)

Name

Description

In header

hpx::parallel::v2::for_loop

Implements loop functionality over a range specified by integral or iterator bounds.

<hpx/include/parallel_for_loop.hpp>

hpx::parallel::v2::for_loop_strided

Implements loop functionality over a range specified by integral or iterator bounds.

<hpx/include/parallel_for_loop.hpp>

hpx::parallel::v2::for_loop_n

Implements loop functionality over a range specified by integral or iterator bounds.

<hpx/include/parallel_for_loop.hpp>

hpx::parallel::v2::for_loop_n_strided

Implements loop functionality over a range specified by integral or iterator bounds.

<hpx/include/parallel_for_loop.hpp>

Executor parameters and executor parameter traits

In HPX we introduce the notion of execution parameters and execution parameter traits. At this point, the only parameter which can be customized is the size of the chunks of work executed on a single HPX-thread (such as the number of loop iterations combined to run as a single task).

An executor parameter object is responsible for exposing the calculation of the size of the chunks scheduled. It abstracts the (potential platform-specific) algorithms of determining those chunks sizes.

The way executor parameters are implemented is aligned with the way executors are implemented. All functionalities of concrete executor parameter types are exposed and accessible through a corresponding hpx::parallel::executor_parameter_traits type.

With executor_parameter_traits clients access all types of executor parameters uniformly:

std::size_t chunk_size =
    executor_parameter_traits<my_parameter_t>::get_chunk_size(my_parameter,
        my_executor, [](){ return 0; }, num_tasks);

This call synchronously retrieves the size of a single chunk of loop iterations (or similar) to combine for execution on a single HPX-thread if the overall number of tasks to schedule is given by num_tasks. The lambda function exposes a means of test-probing the execution of a single iteration for performance measurement purposes (the execution parameter type might dynamically determine the execution time of one or more tasks in order to calculate the chunk size, see hpx::parallel::execution::auto_chunk_size for an example of such a executor parameter type).

Other functions in the interface exist to discover whether a executor parameter type should be invoked once (i.e. returns a static chunk size, see hpx::parallel::execution::static_chunk_size) or whether it should be invoked for each scheduled chunk of work (i.e. it returns a variable chunk size, for an example, see hpx::parallel::execution::guided_chunk_size).

Though this interface appears to require executor parameter type authors to implement all different basic operations, there is really none required. In practice, all operations have sensible defaults. However, some executor parameter types will naturally specialize all operations for maximum efficiency.

In HPX we have implemented the following executor parameter types:

  • hpx::parallel::execution::auto_chunk_size: Loop iterations are divided into pieces and then assigned to threads. The number of loop iterations combined is determined based on measurements of how long the execution of 1% of the overall number of iterations takes. This executor parameters type makes sure that as many loop iterations are combined as necessary to run for the amount of time specified.

  • hpx::parallel::execution::static_chunk_size: Loop iterations are divided into pieces of a given size and then assigned to threads. If the size is not specified, the iterations are evenly (if possible) divided contiguously among the threads. This executor parameters type is equivalent to OpenMP’s STATIC scheduling directive.

  • hpx::parallel::execution::dynamic_chunk_size: Loop iterations are divided into pieces of a given size and then dynamically scheduled among the cores; when a core finishes one chunk, it is dynamically assigned another. If the size is not specified, the default chunk size is 1. This executor parameters type is equivalent to OpenMP’s DYNAMIC scheduling directive.

  • hpx::parallel::execution::guided_chunk_size: Iterations are dynamically assigned to cores in blocks as cores request them until no blocks remain to be assigned. Similar to dynamic_chunk_size except that the block size decreases each time a number of loop iterations is given to a thread. The size of the initial block is proportional to number_of_iterations / number_of_cores. Subsequent blocks are proportional to number_of_iterations_remaining / number_of_cores. The optional chunk size parameter defines the minimum block size. The default minimal chunk size is 1. This executor parameters type is equivalent to OpenMP’s GUIDED scheduling directive.

Using task blocks

The define_task_block, run and the wait functions implemented based on N4411 are based on the task_block concept that is a part of the common subset of the Microsoft Parallel Patterns Library (PPL) and the Intel Threading Building Blocks (TBB) libraries.

These implementations adopt a simpler syntax than exposed by those libraries— one that is influenced by language-based concepts such as spawn and sync from Cilk++ and async and finish from X10. It improves on existing practice in the following ways:

  • The exception handling model is simplified and more consistent with normal C++ exceptions.

  • Most violations of strict fork-join parallelism can be enforced at compile time (with compiler assistance, in some cases).

  • The syntax allows scheduling approaches other than child stealing.

Consider an example of a parallel traversal of a tree, where a user-provided function compute is applied to each node of the tree, returning the sum of the results:

template <typename Func>
int traverse(node& n, Func && compute)
{
    int left = 0, right = 0;
    define_task_block(
        [&](task_block<>& tr) {
            if (n.left)
                tr.run([&] { left = traverse(*n.left, compute); });
            if (n.right)
                tr.run([&] { right = traverse(*n.right, compute); });
        });

    return compute(n) + left + right;
}

The example above demonstrates the use of two of the functions, hpx::parallel::define_task_block and the hpx::parallel::task_block::run member function of a hpx::parallel::task_block.

The task_block function delineates a region in a program code potentially containing invocations of threads spawned by the run member function of the task_block class. The run function spawns an HPX thread, a unit of work that is allowed to execute in parallel with respect to the caller. Any parallel tasks spawned by run within the task block are joined back to a single thread of execution at the end of the define_task_block. run takes a user-provided function object f and starts it asynchronously—i.e. it may return before the execution of f completes. The HPX scheduler may choose to run f immediately or delay running f until compute resources become available.

A task_block can be constructed only by define_task_block because it has no public constructors. Thus, run can be invoked (directly or indirectly) only from a user-provided function passed to define_task_block:

void g();

void f(task_block<>& tr)
{
    tr.run(g);          // OK, invoked from within task_block in h
}

void h()
{
    define_task_block(f);
}

int main()
{
    task_block<> tr;    // Error: no public constructor
    tr.run(g);          // No way to call run outside of a define_task_block
    return 0;
}

Extensions for task blocks

Using execution policies with task blocks

In HPX we implemented some extensions for task_block beyond the actual standards proposal N4411. The main addition is that a task_block can be invoked with a execution policy as its first argument, very similar to the parallel algorithms.

An execution policy is an object that expresses the requirements on the ordering of functions invoked as a consequence of the invocation of a task block. Enabling passing an execution policy to define_task_block gives the user control over the amount of parallelism employed by the created task_block. In the following example the use of an explicit par execution policy makes the user’s intent explicit:

template <typename Func>
int traverse(node *n, Func&& compute)
{
    int left = 0, right = 0;

    define_task_block(
        execution::par,                // execution::parallel_policy
        [&](task_block<>& tb) {
            if (n->left)
                tb.run([&] { left = traverse(n->left, compute); });
            if (n->right)
                tb.run([&] { right = traverse(n->right, compute); });
        });

    return compute(n) + left + right;
}

This also causes the hpx::parallel::v2::task_block object to be a template in our implementation. The template argument is the type of the execution policy used to create the task block. The template argument defaults to hpx::parallel::execution::parallel_policy.

HPX still supports calling hpx::parallel::v2::define_task_block without an explicit execution policy. In this case the task block will run using the hpx::parallel::execution::parallel_policy.

HPX also adds the ability to access the execution policy which was used to create a given task_block.

Using executors to run tasks

Often, we want to be able to not only define an execution policy to use by default for all spawned tasks inside the task block, but also to customize the execution context for one of the tasks executed by task_block::run. Adding an optionally passed executor instance to that function enables this use case:

template <typename Func>
int traverse(node *n, Func&& compute)
{
    int left = 0, right = 0;

    define_task_block(
        execution::par,                // execution::parallel_policy
        [&](auto& tb) {
            if (n->left)
            {
                // use explicitly specified executor to run this task
                tb.run(my_executor(), [&] { left = traverse(n->left, compute); });
            }
            if (n->right)
            {
                // use the executor associated with the par execution policy
                tb.run([&] { right = traverse(n->right, compute); });
            }
        });

    return compute(n) + left + right;
}

HPX still supports calling hpx::parallel::v2::task_block::run without an explicit executor object. In this case the task will be run using the executor associated with the execution policy which was used to call hpx::parallel::v2::define_task_block.