Developing a Software Renderer Part 3

In this post I describe how to add pixel shader capabilities to the software rasterizer and how to optimize it even further for example using OpenMP to parallelize the rasterization.

Pixel Shader Support

Currently the per pixel operations are coded directly inside the innermost loop. We want to have the ability to do arbitrary work per pixel and use the rasterizer as a library.

One simple way to do this would be to define an interface IPixelShader with a virtual method drawPixel and pass this to the rasterizer. I already tried this out and found out that the virtual function call is way to expensive and slows everything down too much. The second best option would be a function pointer but I assumed this will be slow too.

The best thing would be if the compiler could inline our per pixel code, so that there is no function call at all. It turns out there is such a possibility with C++ templates.

We can parameterize the whole triangle rasterization method with our pixel shader class using a c++ template and call a static method in the pixel shader class to draw our pixels. This way the compiler can inline the pixel drawing method inside the inner loop and this should give the best performance.

The pixel shader class will then look like this:

class PixelShader : public PixelShaderBase<PixelShader> {
public:
  static const bool InterpolateZ = false;
  static const bool InterpolateW = false;
  static const int VarCount = 3;

  static SDL_Surface* surface;

  static void drawPixel(const PixelData &p)
  {
    int rint = (int)(p.var[0] * 255);
    int gint = (int)(p.var[1] * 255);
    int bint = (int)(p.var[2] * 255);

    Uint32 color = rint << 16 | gint << 8 | bint;

    Uint32 *buffer = (Uint32*)((Uint8 *)surface->pixels + (int)p.y * surface->pitch + (int)p.x * 4);
    *buffer = color;
  }
};

The pixel shader can tell the rasterizer if the z and w component are supposed to be interpolated and it can also tell the rasterizer how many per vertex attributes to interpolate (VarCount). Our updated Vertex structure now supports arbitrary variables not just RGB color.

struct Vertex {
  float x;
  float y;
  float z;
  float w;

  float var[MaxVar];
};

To use the pixel shader we can use the following code:

rasterizer.setPixelShader<PixelShader>();
rasterizer.drawTriangle(v0, v1, v2);

The trick here is that the setPixelShader<PixelShader>() method gets a member function pointer to the actual drawTriangleTemplate<PixelShader> function and stores it in a variable so that the drawTriangle call does not require the template any longer and will just forward the call to the stored member function.

Parallelization Using OpenMP

With OpenMP we can parallelize our software rasterizer in a simple way. We can use some #pragma statements to do this.

To use OpenMP with CMake we can use the following code in the CMakeLists.txt file:

find_package(OpenMP)
if (OPENMP_FOUND)
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif ()

In the rasterizer we can parallelize the for loop which iterates over all the blocks that need to be drawn. We could prepend the following pragma and be done:

#pragma omp parallel for collapse(2)

Unfortunately collapse only works on Linux since there we have OpenMP 3.0. Under Windows with Visual Studio we need a workaround. We need to restructure the loop so that we do not have any nesting.

int stepsX = (maxX - minX) / BlockSize + 1;
int stepsY = (maxY - minY) / BlockSize + 1;

#pragma omp parallel for

for (int i = 0; i < stepsX * stepsY; ++i)
{
  int sx = i % stepsX;
  int sy = i / stepsX;

  int x = minX + sx * BlockSize;
  int y = minY + sy * BlockSize;

  // Add 0.5 to sample at pixel centers.
  float xf = x + 0.5f;
  float yf = y + 0.5f;

  [...]

When we now run the rasterizer it is indeed faster but on my machine I only achieved a speedup of 2x.

Scanline Triangle Rasterization

After experimenting a bit I was not satisfied with the performance of the rasterizer and I tried some alternative approaches. Instead of doing the block based rasterization I followed a scanline based rasterization approach as desribed in Software Rasterization Algorithms for Filling Triangles.

This approach uses the fact that there are two special cases, a bottom flat and top flat triangle, which are easy to rasterize. It splits each triangle into two such easy triangles and rasterizes them individually.

template <class PixelShader>
void drawTriangleSpanTemplate(const Vertex& v0, const Vertex &v1, const Vertex &v2) const
{
  // Compute triangle equations.
  TriangleEquations eqn(v0, v1, v2, PixelShader::VarCount);

  // Check if triangle is backfacing.
  if (eqn.area2 <= 0)
    return;

  const Vertex *t = &v0;
  const Vertex *m = &v1;
  const Vertex *b = &v2;

  // Sort vertices from top to bottom.
  if (t->y > m->y) std::swap(t, m);
  if (m->y > b->y) std::swap(m, b);
  if (t->y > m->y) std::swap(t, m);

  float dy = (b->y - t->y);
  float iy = (m->y - t->y);

  // Trivial case top-flat triangle
  if (m->y == t->y)
  {
    const Vertex *l = m, *r = t;
    if (l->x > r->x) std::swap(l, r);
    drawTopFlatTriangle<PixelShader>(eqn, *l, *r, *b);
  }
  // Trivial case bottom-flat triangle
  else if (m->y == b->y)
  {
    const Vertex *l = m, *r = b;
    if (l->x > r->x) std::swap(l, r);
    drawBottomFlatTriangle<PixelShader>(eqn, *t, *l, *r);
  }
  else
  {
    // General case - split the triangle
    Vertex v4;
    v4.y = m->y;
    v4.x = t->x + ((b->x - t->x) / dy) * iy;
    if (PixelShader::InterpolateZ) v4.z = t->z + ((b->z - t->z) / dy) * iy;
    if (PixelShader::InterpolateW) v4.w = t->w + ((b->w - t->w) / dy) * iy;
    for (int i = 0; i < PixelShader::VarCount; ++i)
      v4.var[i] = t->var[i] + ((b->var[i] - t->var[i]) / dy) * iy;

    const Vertex *l = m, *r = &v4;
    if (l->x > r->x) std::swap(l, r);

    drawBottomFlatTriangle<PixelShader>(eqn, *t, *l, *r);
    drawTopFlatTriangle<PixelShader>(eqn, *l, *r, *b);
  }
}

The triangle filling is shown in the code block below. It can be parallelized with OpenMP which gives better performance.

template <class PixelShader>
void drawBottomFlatTriangle(const TriangleEquations &eqn, const Vertex& v0, const Vertex &v1, const Vertex &v2) const
{
  float invslope1 = (v1.x - v0.x) / (v1.y - v0.y);
  float invslope2 = (v2.x - v0.x) / (v2.y - v0.y);

  #pragma omp parallel for

  for (int scanlineY = int(v0.y + 0.5f); scanlineY < int(v1.y + 0.5f); scanlineY++)
  {
    float dy = (scanlineY - v0.y) + 0.5f;
    float curx1 = v0.x + invslope1 * dy + 0.5f;
    float curx2 = v0.x + invslope2 * dy + 0.5f;

    // Clip to scissor rect
    int xl = std::max(m_minX, (int)curx1);
    int xr = std::min(m_maxX, (int)curx2);

    PixelShader::drawSpan(eqn, xl, scanlineY, xr);
  }
}

template <class PixelShader>
void drawTopFlatTriangle(const TriangleEquations &eqn, const Vertex& v0, const Vertex &v1, const Vertex &v2) const
{
  float invslope1 = (v2.x - v0.x) / (v2.y - v0.y);
  float invslope2 = (v2.x - v1.x) / (v2.y - v1.y);

  #pragma omp parallel for

  for (int scanlineY = int(v2.y - 0.5f); scanlineY > int(v0.y - 0.5f); scanlineY--)
  {
    float dy = (scanlineY - v2.y) + 0.5f;
    float curx1 = v2.x + invslope1 * dy + 0.5f;
    float curx2 = v2.x + invslope2 * dy + 0.5f;

    // Clip to scissor rect
    int xl = std::max(m_minX, (int)curx1);
    int xr = std::min(m_maxX, (int)curx2);

    PixelShader::drawSpan(eqn, xl, scanlineY, xr);
  }
}

With the scanline based approach I get between 10-15% better performance when filling random triangles. Still that does not mean, that the scanline based approach is better in all the cases.

With the block based approach it would be relatively easy to support some form of a hierarchical z-buffer and cull whole blocks much faster if there is overdraw. This is not so simple with the scanline based approach.

One would have to compare the approaches in this specific situation.

Fixed Point Calculations

I tried out if replacing the floating point calculations in the innermost loops with fixed point calculations would give better performance. This way the pixel shader would not get floating point values for its attributes any more but fixed point values.

I implemented a class fixed<Bits> which I used to replace most floating point calculations that calculated the pixel data but kept the floating point calculations, which are done per triangle, so the rasterizer still takes floating point values for the vertex data as input.

I was able to achieve a considerable performance gain of 10-20% with that for filling random triangles with rgb color. The problem was that there were some color artifacts, so the PixelShader had to be modified to fix the artifacts. I needed to check for value overflow in the drawSpan method of the PixelShader and depending on the outcome of the test route the processing to a specialized drawPixel method. When I did the overflow test per pixel all the performance gain was lost.

Conclusion

The added pixel shader support makes the software renderer very modular and flexible and with the template based approach we get the best performance out of it.

Using OpenMP for parallelization we can improve the performance. But also using a scanline based approach and fixed point math makes quite a big difference.

In the next parts I will develop the vertex processing stage with a vertex cache and vertex shader support so that some more complex 3D models can be rendered.

Continue on Part 4.

Checkout the source at my github repo.