做h的游戲 迅雷下載網(wǎng)站線上推廣100種方式
接上篇文章,可以發(fā)現(xiàn)使用CUDA提供的API進行前綴和掃描時,第一次運行的時間不如共享內(nèi)存訪問,猜測是使用到了全局內(nèi)存。
首先看調(diào)用邏輯:
thrust::inclusive_scan(thrust::device, d_x, d_x + N, d_x);
第一個參數(shù)指定了設(shè)備,根據(jù)實參數(shù)量和類型找到對應(yīng)的函數(shù),是scan.h中的如下函數(shù):
template <typename DerivedPolicy, typename InputIterator, typename OutputIterator>
_CCCL_HOST_DEVICE OutputIterator inclusive_scan(const thrust::detail::execution_policy_base<DerivedPolicy>& exec,InputIterator first,InputIterator last,OutputIterator result);
其實現(xiàn)位于thrust\thrust\system\cuda\detail\scan.h
注意:路徑可能與實際有偏差,可以在/usr/local/下使用find . -name xx
查找對應(yīng)的文件
template <typename Derived, typename InputIt, typename OutputIt>
_CCCL_HOST_DEVICE OutputIt
inclusive_scan(thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result)
{return thrust::cuda_cub::inclusive_scan(policy, first, last, result, thrust::plus<>{});
}
將操作指定為plus,
然后執(zhí)行同一文件下的此函數(shù):
template <typename Derived, typename InputIt, typename OutputIt, typename ScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan(thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, InputIt last, OutputIt result, ScanOp scan_op)
{using diff_t = typename thrust::iterator_traits<InputIt>::difference_type;diff_t const num_items = thrust::distance(first, last);return thrust::cuda_cub::inclusive_scan_n(policy, first, num_items, result, scan_op);
}
最終找到主要的執(zhí)行邏輯:
_CCCL_EXEC_CHECK_DISABLE
template <typename Derived, typename InputIt, typename Size, typename OutputIt, typename ScanOp>
_CCCL_HOST_DEVICE OutputIt inclusive_scan_n_impl(thrust::cuda_cub::execution_policy<Derived>& policy, InputIt first, Size num_items, OutputIt result, ScanOp scan_op)
{using AccumT = typename thrust::iterator_traits<InputIt>::value_type;using Dispatch32 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int32_t, AccumT>;using Dispatch64 = cub::DispatchScan<InputIt, OutputIt, ScanOp, cub::NullType, std::int64_t, AccumT>;cudaStream_t stream = thrust::cuda_cub::stream(policy);cudaError_t status;// Determine temporary storage requirements:size_t tmp_size = 0;{THRUST_INDEX_TYPE_DISPATCH2(status,Dispatch32::Dispatch,Dispatch64::Dispatch,num_items,(nullptr, tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));thrust::cuda_cub::throw_on_error(status,"after determining tmp storage ""requirements for inclusive_scan");}// Run scan:{// Allocate temporary storage:thrust::detail::temporary_array<std::uint8_t, Derived> tmp{policy, tmp_size};THRUST_INDEX_TYPE_DISPATCH2(status,Dispatch32::Dispatch,Dispatch64::Dispatch,num_items,(tmp.data().get(), tmp_size, first, result, scan_op, cub::NullType{}, num_items_fixed, stream));thrust::cuda_cub::throw_on_error(status, "after dispatching inclusive_scan kernel");thrust::cuda_cub::throw_on_error(thrust::cuda_cub::synchronize_optional(policy), "inclusive_scan failed to synchronize");}return result + num_items;
}
可以看到,此處thrust調(diào)用了cub的Dispatchscan操作,而cub中是使用全局內(nèi)存的,因此造成了效率還不如手動編寫使用共享內(nèi)存的算法。