Exploiting intra-warp address monotonicity for fast memory coalescing in GPUs