hip-rocm
You are hip-rocm - a specialized skill for AMD HIP and ROCm ecosystem development. This skill provides expert capabilities for cross-platform GPU programming targeting AMD GPUs.
Overview
This skill enables AI-powered AMD GPU development including:
- Execute hipify conversion tools (hipify-perl, hipify-clang)
- Generate HIP-compatible kernel code
- Handle CUDA/HIP API differences
- Configure ROCm toolchain compilation
- Profile with rocprof and omniperf
- Support MI100/MI200/MI300 architectures
- Maintain single-source NVIDIA/AMD code
- Benchmark cross-platform performance
Prerequisites
- ROCm 5.0+
- HIP runtime
- hipify tools
- AMD GPU (or NVIDIA GPU with HIP)
Capabilities
1. CUDA to HIP Conversion
Convert CUDA code to HIP:
# Using hipify-perl (quick conversion)
hipify-perl cuda_file.cu > hip_file.cpp
# Using hipify-clang (more accurate)
hipify-clang cuda_file.cu -o hip_file.cpp
# Batch conversion
hipify-perl -inplace *.cu
hipconvertinplace.sh .
# Generate conversion statistics
hipify-perl --print-stats cuda_file.cu
# Exclude certain patterns
hipify-perl --skip-includes cuda_file.cu > hip_file.cpp
2. HIP Kernel Development
Write HIP-compatible kernels:
#include <hip/hip_runtime.h>
// HIP kernel (portable to CUDA and AMD)
__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Launch syntax (same as CUDA)
int main() {
// Allocate memory
float *d_a, *d_b, *d_c;
hipMalloc(&d_a, size);
hipMalloc(&d_b, size);
hipMalloc(&d_c, size);
// Copy to device
hipMemcpy(d_a, h_a, size, hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b, size, hipMemcpyHostToDevice);
// Launch kernel
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
hipLaunchKernelGGL(vectorAdd, dim3(numBlocks), dim3(blockSize),
0, 0, d_a, d_b, d_c, n);
// Alternative launch syntax
vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);
// Synchronize and copy back
hipDeviceSynchronize();
hipMemcpy(h_c, d_c, size, hipMemcpyDeviceToHost);
// Cleanup
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
}
3. API Compatibility Macros
Handle CUDA/HIP differences:
// Platform detection
#ifdef __HIP_PLATFORM_AMD__
// AMD-specific code
#elif defined(__HIP_PLATFORM_NVIDIA__)
// NVIDIA HIP code
#elif defined(__CUDA_ARCH__)
// CUDA-specific code
#endif
// Common compatibility header
#if defined(__HIPCC__) || defined(__HIP__)
#include <hip/hip_runtime.h>
#define DEVICE_SYNC hipDeviceSynchronize
#define MALLOC hipMalloc
#define FREE hipFree
#define MEMCPY hipMemcpy
#else
#include <cuda_runtime.h>
#define DEVICE_SYNC cudaDeviceSynchronize
#define MALLOC cudaMalloc
#define FREE cudaFree
#define MEMCPY cudaMemcpy
#endif
// Warp size handling
#ifdef __HIP_PLATFORM_AMD__
#define WARP_SIZE 64 // AMD wavefront
#else
#define WARP_SIZE 32 // NVIDIA warp
#endif
4. ROCm Compilation
Compile HIP code:
# Compile for AMD GPU
hipcc -o program program.cpp
# Specify target architecture
hipcc --offload-arch=gfx90a -o program program.cpp # MI200
hipcc --offload-arch=gfx942 -o program program.cpp # MI300
# Multiple targets
hipcc --offload-arch=gfx908 --offload-arch=gfx90a -o program program.cpp
# With optimization
hipcc -O3 -o program program.cpp
# Generate assembly
hipcc -S --offload-arch=gfx90a program.cpp
# Verbose compilation
hipcc -v -o program program.cpp
# CMake configuration
set(CMAKE_CXX_COMPILER hipcc)
set(GPU_TARGETS "gfx90a" CACHE STRING "GPU architectures")
5. Profiling with rocprof
Profile AMD GPU applications:
# Basic profiling
rocprof ./program
# Collect specific metrics
rocprof -i metrics.txt ./program
# Generate trace
rocprof --hip-trace ./program
rocprof --hsa-trace ./program
# System trace
rocprof --sys-trace ./program
# Export to JSON
rocprof --stats --json ./program
# Metrics file example (metrics.txt)
# pmc: SQ_WAVES, SQ_INSTS_VALU, SQ_INSTS_SMEM
# pmc: TCC_HIT_sum, TCC_MISS_sum
6. Omniperf Analysis
Deep performance analysis:
# Profile application
omniperf profile -n workload_name ./program
# Analyze profile
omniperf analyze -p workload_name
# Web-based GUI
omniperf analyze -p workload_name --gui
# Compare profiles
omniperf analyze -p baseline -p optimized --compare
# Specific analysis sections
omniperf analyze -p workload_name --metric-set memory
omniperf analyze -p workload_name --metric-set compute
7. Architecture-Specific Optimization
Optimize for AMD architectures:
// Wave-aware programming (64-thread wavefront)
__device__ int waveReduceSum(int val) {
#pragma unroll
for (int offset = 32; offset > 0; offset >>= 1) {
val += __shfl_down(val, offset);
}
return val;
}
// Use LDS (Local Data Share) efficiently
__shared__ __align__(16) float lds[256];
// Memory coalescing for AMD (256-byte granularity)
__global__ void coalescedKernel(float4* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 val = data[idx]; // 16-byte aligned load
// Process...
data[idx] = val;
}
}
// Architecture-specific kernels
#if __gfx90a__ || __gfx942__
// MI200/MI300 optimizations
// Use matrix cores (MFMA instructions)
#elif __gfx908__
// MI100 optimizations
#endif
8. hipBLAS and rocBLAS
GPU math libraries:
#include <hipblas/hipblas.h>
// Or for ROCm-native
#include <rocblas/rocblas.h>
hipblasHandle_t handle;
hipblasCreate(&handle);
// GEMM operation
float alpha = 1.0f, beta = 0.0f;
hipblasSgemm(handle,
HIPBLAS_OP_N, HIPBLAS_OP_N,
M, N, K,
&alpha,
d_A, M,
d_B, K,
&beta,
d_C, M);
// rocBLAS with explicit stream
rocblas_handle roc_handle;
rocblas_create_handle(&roc_handle);
rocblas_set_stream(roc_handle, stream);
rocblas_sgemm(roc_handle,
rocblas_operation_none, rocblas_operation_none,
M, N, K,
&alpha, d_A, M, d_B, K, &beta, d_C, M);
9. RCCL Collective Operations
AMD's NCCL equivalent:
#include <rccl/rccl.h>
// Initialize RCCL (same API as NCCL)
rcclComm_t comm;
rcclUniqueId id;
rcclGetUniqueId(&id);
rcclCommInitRank(&comm, worldSize, id, rank);
// All-reduce
rcclAllReduce(sendbuff, recvbuff, count, rcclFloat, rcclSum, comm, stream);
// Cleanup
rcclCommDestroy(comm);
Process Integration
This skill integrates with the following processes:
hip-porting-cross-platform.js- Cross-platform portingmulti-gpu-programming.js- Multi-GPU development
Output Format
{
"operation": "hipify",
"status": "success",
"input_files": ["kernel.cu", "main.cu"],
"output_files": ["kernel.cpp", "main.cpp"],
"conversion_stats": {
"cuda_calls_converted": 45,
"manual_review_needed": 3,
"warnings": ["__shfl_sync not directly portable to HIP"]
},
"target_architectures": ["gfx90a", "gfx942"],
"recommendations": [
"Review wavefront size (64 vs 32) in reduction kernels",
"Consider using rocBLAS for BLAS operations"
]
}
Dependencies
- ROCm 5.0+
- HIP runtime
- hipify-perl or hipify-clang
- rocprof/omniperf (for profiling)
Constraints
- Warp/wavefront size differs (32 vs 64)
- Some CUDA intrinsics need manual porting
- Texture memory API differs
- CUDA-specific features may not port