Physically Based 3D Rendering and Ray Marching

Rendering computer graphics as long been one of my interests, either at work (stuff related to photogrammetry) or as a hobby (thanks to the excellent POV-Ray rendering engine in particular). Following my previous articles about colorimetry and color spaces (1, 2), I felt like diving into that subject one more time.

The pinhole camera model, matrix transformations of meshes, sRGB colors are what I've been taught at university and used in the past to implement a few crappy homemade renderers. This is all very well explained in that series of articles for example. However, aroused by all my recent reading about color theory and videos about ray marching on The Art Of Code, I decided to keep studying these subjects and try to implement a renderer based on them.

After a few months of work, it ended up as a half-failure. I could reach a renderer able to produce relatively nice looking images, but still had many issues to solve. They would require way more time than I could/wanted to invest, so I gave up on that project. In this article I want to gather the new things I've learnt, what I've tried and the results I've been able to reach, as a reference when I'll challenge myself once again on that subject.

Physically based rendering is a fuzzy concept simply saying "we'll try to simulate the real world as best as we can". That's what every renderer tried to do since the dawn of time, but as modern computers has become powerful enough to render images almost indistinguishable from photographies (1, 2, 3) the expression is getting more commonly used. Even though, there are still many shortcuts and approximations which commercials take good care to hide and avoid as usual. If you're looking for the state of the art of PBR implementation, the reference book is here. I haven't read it, what's below is my very modest and limited understanding on the subject.

Rendering process

At the highest level the rendering process consists of calculating the sRGB value of each pixels in an image from the description of a scene and an observer. The scene is a collection of objects whose shape I decide to define in compatibility with the ray marching algorithm, and whose material definition describes how it interacts with light. The observer is a transformation operation converting a portion of the flow of light bathing the scene into its representation as an image. The observer may represent a camera or biological eye. I'll use the term 'camera' indifferently in what follow.

As it is a very slow process I make it multithreaded: several parallel threads compute in parallel the image pixels. Displaying the result image in real time as it is rendered helps checking everything works fine and avoid having to wait for the whole image to be rendered and realise it's completely wrong. Also, computing pixels in a non sequential order (left to right, top to bottom) helps a lot getting a good idea of the final image at early stage of the rendering. If the 2D matrix of pixels is an array where pixels are stored by row, and the image is rectangular (which is most often the case), I've found that a simple hack of the calculation of the pixel index from its row and column indexes works very well. If the image is square it doesn't break, just render in sequential order, so it's ok.

The images below contains only 10% and 20% of rendered pixels, yet one would already be able to notice potential issues like wrong exposure, framing, composition, lighting, etc...

The sRGB value of a pixel is calculated as the average over the observer's photoreceptive cells associated to that pixel.

Of course, the more cells per pixel, the higher the quality of the image and the longer the rendering time.

Spectral power distribution

The result pixel color is encoded as sRGB, cause that's what everybody's using, but all calculations are made using spectral power distribution to describe the flow of light. This doesn't have to be so, it's a personal choice following my previous study of colorimetry. The intent is to stay as near as possible to the "reality".

Using 380-700nm as the range for visible light and 5nm as discretisation step (1), a SPD can be represented with an array of \((700-380)/5=64\) float values. Multiplying by the number of pixels we get the amount of memory needed to memorise the whole flow of light making up the image. Using my Nikon D5100 as a reference camera (sensor of W23.6mm x H15.6mm and W4928px x H3264px according to Wikipedia) and 4 bytes encoding for floating point values, it becomes \(64*4*4928*3264=3.8Gb\). Using Full HD (1920px x 1080px) instead, it becomes \(64*4*1920*1080=530Mb\). These numbers must be multiplied by the number of cells per pixel if one want to memorise the whole sensor information at once. If a modern desktop PC should be able to hold in memory the discretised SPD of all cells for Full HD, good quality image for higher resolutions may not be possible. However one doesn't need all cells at once, only the ones of the currently computed pixels are sufficient to compute the final image.

There is also another possible representation of a SPD, a mixture of piecewise Gaussian (MoPG): \(SPD(\lambda)=\sum_i(G(\lambda,b_i,a_i,\mu_i,\sigma_i^l,\sigma_i^r))\), where $$ G(\lambda,b,a,\mu,\sigma_l,\sigma_r)=\left\lbrace{ \begin{array}{l} b+a.exp(-(\lambda-\mu)^2/(2\sigma_l^2)),\forall \lambda\le\mu\\ b+a.exp(-(\lambda-\mu)^2/(2\sigma_r^2)),\forall \lambda\gt\mu\\ \end{array}}\right. $$ This representation avoid discretisation and is more compact if the number of Gaussian is less than \(\lfloor 64/5\rfloor =12\). For SPD with more Gaussians, a possible strategy is to calculate another mixture with less Gaussians and approximating at best the original mixture.

The SPD is affected by materials in two ways: emission and absorption. The interaction varies with the wavelength (for example a material can absorb more of a wavelength range and less of another) and the material properties (for example a material emission varies with its temperature). The transformation applied to the SPD of a flow of light passing through a point \(\vec{P}\) occupied by a material \(M\) can be described as \(\mathcal{M}(SPD, M,\vec{ P},\lambda)=SPD(\lambda).\mathcal{R}(M,\vec{P},\lambda)+\mathcal{E}(M,\vec{P},\lambda)\) where \(\mathcal{R}(M,\vec{P},\lambda)\in[0,1]\) is the reflectance (the opposite of the absorption) of the material \(M\) at point \(\vec{P}\) for wavelength \(\lambda\) and \(\mathcal{E}(M,\vec{P},\lambda)\in[0,\infty[\) is the emission of the material \(M\) at point \(\vec{P}\) for wavelength \(\lambda\). I've written more about material's reflectance in this article and made an app to obtain and edit the reflectance of real world material. The CIE standard illuminants give some reference emission spectrum. Other emission spectrum seems difficult to obtain.

Both \(\mathcal{R}\) and \(\mathcal{E}\) can be approximated with MoPGs. The result SPD reaching a cell is then the integration of these MoPGs along the path of the flow of light reaching that cell. This requires to know these MoPGs at any point. Approximating a material as homogeneous, i.e. \(\mathcal{R}\) and \(\mathcal{E}\) constants with respect to \(\vec{P}\) simplifies greatly calculation here, and that certainly explain why I can't find any example online of really PBR non homogeneous material (only cheating homogeneous volume covered by a non homogeneous texture).

Edit on 2023/08/06: In this article, Ive provides macros and explanations to perform rendering from spectral power distribution using POV-Ray for those interested.

Camera

The principle of a basic camera is not so complicate, but modern camera made up of complex assembly of lenses and high performance sensors are much more complex, not speaking of their biological equivalent. Taking one step further from the pinhole model toward realism without trying to bite more than I can chew, I divide the camera model into two components: a sensor and a lens+diaphragm apparatus.

Camera sensor

The sensor is a grid of photoreceptive cells. The radiant flux reaching each cell is described by a SPD, and the cell response (in CIE XYZ) is calculated by integrating that SPD with the color matching functions of the sensor (cf my previous article). The CMFs of the Nikon D5100 CCD have been studied here. For the human eye, or by default, I use the CIE standard observer. I've covered in previous articles (1, 2) how to perform the conversion from CIE XYZ to sRGB.

The dimensions of the sensor need to be known in meters (real dimensions) and pixels (cell grid dimensions). For cameras, the specifications are generally available online. If we want to simulate the human eye, we could approximate the sensor size with the average eye dimension: W24.2mm x H23.7mm, and the grid size from the number of cones (6 millions): W2475px x H2424px.

Finally, there is one last point to specify about the sensor: the sensitivity. I tried to figure out what's behind the ISO number, got completely lost and gave up. What I do instead is to calibrate the implementation using the sunny 16 rule: calculate the sRGB output of standard illuminant D65 for a f-number equals to 16, an exposure time of 1/100s, and scale the response to normalise it to 1.0.

I also initially thought of simulating the film grain of analog cameras. Refering to this paper it can be done by altering the SPD of each sensor cell \(c\) as follow: \(spd'_c(\lambda)=spd_c(\lambda)*(W_{6.5e-6}(\lambda)+W_{15e-6}(\lambda))/2\) where \(W_d(\lambda)=W_c^*e^{-(\frac{2}{9\pi})^{\frac{1}{3}}(\pi\lambda d)^2}\) and \(W_c^*=1\pm G.rand_c([0,1])\). \(G\) measures how strong the noise is. I gave up before implementing this part.

Camera lens and diaphragm

The radiant flux reaching a cell is made of the flow of light entering the camera through the diaphragm aperture and redirected toward the cell by the lens. It's another continuous function which needs to be discretised. It is done by sampling positions in the aperture, calculating the SPD of light transiting along rays traveling from these sampling positions to the cell, and approximating the radiant flux with the sum of these SPD multiplied by the exposure time. I call these rays the 'inner' rays.

I choose to distribute the sampling points inside the aperture according to the Fibonacci polar lattice (randomly rotated per cell), and to calculate the number of sampling position as \(n=max(1,\lfloor 0.5\pi r^2\rfloor)\) where \(r=l/(2f)\) is the radius of the aperture, and \(l\) is the focal length (in millimeter) and \(f\) is the aperture (f-number). The Fibonacci lattice is often said to give the best results for that purpose. The choice of the number of sample is arbitrary and questionable.

Sampling at a discrete number of location has the inconvenient of discretizing the amount of light reaching the sensors, while it should be continuous in function of the aperture. To correct for that the radiant flux \(r_s\) is corrected as follow: \(r_s=r_a/n*\pi r^2\) where \(r_a\) is the sum of the sample's radiant flux and \(r\) if the radius of the aperture (\(0.5*d_f/f\)).

The aperture measures how much the diaphragm is opened. The larger the aperture value, the less light reaches the sensor (yes, that's counter-intuitive), the larger the depth of field (sharp whatever the distance), the nearer to a pinhole camera model. Its value (also called f-number) is equal to the focal length divided by the diameter of the diaphragm opening.

The focal length of the lens (in millimeter, by default equal to eye nominal focal length: 17mm) controls the field of view. The shorter the focal length the larger the field of view, the longer the smaller (increasing the focal length is like zooming). It also relates the focus distance \(d_f\) with the distance from the lens to an object in focus \(d_o\): \(1/f=1/d_o+1/d_f\). The focus distance to put an object at a given distance in focus can then be calculated as follow: \(d_f=(1/f-1/d_o)^{-1}\). The same way, the distance at which an object is in focus for a given focus distance can be calculated as follow: \(d_o=(1/f-1/d_f)^{-1}\). Checking for the eye (\(f=17mm\)), if the object is \(d_o=10cm\) away from the lens the focus point is \(d_f=20.5mm\) away from the lens, if \(d_o=1m\) then \(d_f=17.3mm\), if \(d_o=100m\) then \(d_f=17.0mm\). For a 35mm camera lens (\(f=35mm\)), if \(d_o=10cm\) then \(d_f=53.8mm\), if \(d_o=1m\) then \(d_f=36.3mm\), if \(d_o=100m\) then \(d_f=35mm\).

Each inner ray's SPD is equal to its corresponding outer ray's SPD (it means the lens doesn't affect the incoming light, which isn't true). The outer ray reaches the lens at the position \(A\) the inner ray originates from. The direction of the outer ray \(\overrightarrow{OP}\) coming from the inner ray \(\overrightarrow{SO}\) passing through the center of the lens \(O\) is the same as the direction of that inner ray. Lets call that outer ray the central outer ray. The direction of other outer rays \(\overrightarrow{AP}\) are such as they all intersect the central outer ray on the focus plane (the plane containing the points in focus).

From the specifications and explanations above we know \(S\), \(O\), \(A\), \(d_f\) and \(d_o\). \(\overrightarrow{AP}\) can be calculated as follow. $$ \overrightarrow{SP}=\overrightarrow{SA}+\overrightarrow{AP} $$ then $$ \overrightarrow{AP}=\overrightarrow{SP}-\overrightarrow{SA} $$ \(\overrightarrow{SA}\) is simply \((0,0,d_f)^T\). \(\overrightarrow{SP}\) is calculated as follow. $$ \overrightarrow{SP}=\frac{d_f+d_o}{d_f}\overrightarrow{SO} $$ or $$ \overrightarrow{SP}=\frac{d_f+d_o}{d_f}(-S_x,-S_y,d_f)^T $$ Then, $$ \overrightarrow{AP}=\frac{d_f+d_o}{d_f}(-S_x,-S_y,d_f)^T-(A_x-S_x,A_y-S_y,d_f)^T $$ or $$ \overrightarrow{AP}=(A_x-(1+\frac{d_f+d_o}{d_f})S_x,A_y-(1+\frac{d_f+d_o}{d_f})S_y,d_o)^T $$

This gives a bundle of initial outer rays to be solved to get the radiant flux reaching a cell.

Flow of light

I've explained earlier how materials affect the SPD of the flow of light. They also affect the flow of light by bending and scattering its direction.

The bending is calculated from the refractive index of materials. The refractive index of a material is a function of the wavelength, approximated by the Cauchy equation. This equation as an infinite number of parameters, but for practical purpose only the first two are used: \(n(\lambda)=A+\frac{B}{\lambda^2}\).

The value of these parameters for some common materials are available on the Wikipedia page (link above), or on refractiveindex.info. If the diffraction of light (cf below) is not relevant, one can also use the values available here for \(A\) and set \(B\) to \(0.0\). Or, one can use two parameters and only one wavelength (conventionnally 589.29nm). For an unknown material, or artistic purpose, one can also estimate \(A\) from the reflectivity \(r\) (cf below) as follow: \(A=\frac{1.000293+\sqrt{r}}{1.000293-\sqrt{r}}\) and set \(B\) to \(0.0\).

Edit on 2024/05/12: Correction of the formula for estimation of A from reflectivity.

As for \(\mathcal{R}\) and \(\mathcal{E}\), strictly speaking \(n\) varies also with \(\vec{P}\) but for practical purpose it is considered to be constant (homogeneous material). There is one notable case where this approximation is problematic: mirage which are induced by refractive index varying with air temperature which isn't homogeneous. One could approximate this phenomenon by splitting the volume of air into several layers or blobs, each with a different refractive index.

The homogeneous material hypothesis implies that bending only occurs at the interface of two different materials. In what follows I'll use \(M_o\) and \(M_i\) to describe respectively the material in which the flow of light is traveling (outside material) and the material the flow of light is traveling into (inside material). \(n_o\) and \(n_i\) are their respective refractive index. The interaction point \(\vec{I}\) is the interaction of the flow of light and the interface between \(M_o\) and \(M_i\).

Reflectivity

The reflectivity \(r\in[0,1]\) is the amount of light bouncing back from \(M_i\) into \(M_o\) (commonly called reflection). Given that \(\theta_o\) is the angle between the incoming ray and the normal (oriented toward \(M_o\)) of the interface at the interaction point, it is calculated using the Fresnel equation: \(r(\lambda)=\left(\frac{n_o(\lambda)-n_i(\lambda)}{n_o(\lambda)+n_i(\lambda)}\right)^2\) if \(\frac{n_o(\lambda)}{n_i(\lambda)}sin(\theta_o)\le1.0\), \(1.0\) else (in which case it's called total internal reflection). If \(SPD_r\) is the SPD of the flow of light bouncing back and \(SPD\) is the SPD of the incoming light: \(SPD_r(\lambda)=r(\lambda).SPD(\lambda)\).

Note that in case of metallic materials the reflected light is also affected by the absorption component of the inside material. The SPD of the flow of light bouncing back then becomes: \(SPD_r(\lambda)=r(\lambda).SPD(\lambda).\mathcal{R}(M_i,\vec{I},\lambda)\).

The direction of the flow of light \(\vec{v}\) bouncing back is simply the symetric of the direction of the incoming flow \(\vec{u}\) relatively to the normal \(\vec{n}\) (oriented toward \(M_o\) and normalised) of the interface at the interaction point: \(\vec{v}=\vec{u}+2\vec{n}\).

As objects rarely have a smooth, well polished surface, calculating the reflection only based on the normal at the interaction point gives poor results. For better result I use a roughness coefficient \(r\in[0,+\infty[\) used as follow. If \(r>\epsilon\), the reflected flow is divided into \(\lfloor r\rfloor\) sub flows and the direction of the i-th sub flow \(\vec{s_i'}\) is defined as \(Rot_{\vec{v},\theta_i}(\vec{v}+c_ir/R(\vec{u}*\vec{n}))\), where \(Rot_{\vec{a},b}(\vec{c})\) is the rotation of \(\vec{c}\) around \(\vec{a}\) by \(b\) radians (left-hand coordinate system), \(\theta_i\in[0,2\pi]\) is a random angle, \(R=100\) is a constant empirically choosen, and \(c_i\in[0,1]\) is a random value. Finally, sub flows directed toward the inside material after applying roughness are discarded.

Transmissivity

The transmissivity is simply the complementary of the reflectivity: \(t(\lambda)=1-r(\lambda)\). The SPD of the transmitted flow of light is then calculated the same way as the one of the reflected flow of light. It is the portion of the flow of light passing from \(M_o\) into \(M_i\).

If \(t\ne 0.0\) the direction of the transmitted flow of light \(\vec{v}\) is calculated as follow. If \(\vec{u}\) is the direction of the incoming flow of light, \(\vec{n}\) the normal (oriented toward \(M_o\) and normalised) of the interface at the interaction point, \(\phi=sin^{-1}(\frac{n_o(\lambda)}{n_i(\lambda)}sin(\theta_o))-\theta_o\), \(Rot_{\vec{a},b}(\vec{c})\) is the rotation of \(\vec{c}\) around \(\vec{a}\) by \(b\) radians (left-hand coordinate system), then \(\vec{v}=Rot_{\vec{u}*\vec{n},\phi}(\vec{u})\). When \(\phi\ne 0.0\) we say that the light is refracted ("broken" pencil in a cup of water). When the second parameter of the refractive index \(B\ne 0.0\), the wavelengths of the incoming flow of light get spreaded over a range of angles, it's called diffraction (rainbow coming out of a prism).

Scattering

While moving inside a material the flow of light interacts with it by bouncing into its molecules. This is called scattering. It occurs for all materials even those that we would called totally opaque, the difference is just that for those opaque materials the light penetrates only until an infinitely small depth before bouncing back out of the material. The opposite are completely transparent material, where the light can flow through the material almost without any interaction in a straight line.

Strictly speaking, to simulate scattering one would have to simulate every photons and know the probablity it bounces on a molecule (atom?). Then, for each photon one would calculate the distance until the next collision, randomly redirect the photon, and repeat until the photon eventually get out of the material (just to start again with the outside material...). This is of course totally impractical.

The solution I came up with is as follow. I defined an 'opacity' value \(o\in]0,+\infty[\) for a material, the smaller the more opaque, the larger the more transparent. Then, calculate the scattering distance \(d_s=min(o, d)/2\) where \(d\) is the distance from the interaction point where the flow entered the material to the interaction point where the flow would exit the material if there was no scattering. Define the scattering point as \(\vec{S}=\vec{I}+d_s\vec{u}\) where \(\vec{u}\) is the direction of the flow of light in the material. Then, split the flow of light into \(n_s\) sub-flows of light with direction spreaded in all direction using a Fibonacci lattice (randomly rotated). The higher \(n_s\), the more realistic the rendering is, the more time it takes to render. Then, for each subflow, calculate the next interaction point \(\vec{J}\) (the location where it's getting out of the material). Then, calculate \(c=1/max(1,||\overrightarrow{JS}||/o)\). If \(c<\epsilon\) discard the subflow, else calculate the subflow from \(\vec{J}\) in direction \(\overrightarrow{JS}\) and multiply its SPD by \(c\) and \(\mathcal{R}(M,\vec{J},\lambda)\).

About \(n_s\), I choose a value based on the 'depth' of the ray. I explained how the flow of light is splitted into subflows when interacting with materials, and how the SPD of a cell is calculated from initial outer rays. These initial outer rays are at depth 0, the rays associated to the sub flows of light forming these initial outer rays are at depth 1, and so on. Then, for a depth \(e\), I set empirically \(n_s=\frac{50}{5^e}\). This also gives an ending condition (\(n_s=0\)) to the infinite subdivision into sub flows.

Primitive shapes

I've described how to simulate the interaction between flows of light and materials. It requires to locate the interaction points, the intersection between a flow of light and the interface between two materials, in other word between a ray and a geometrical shape. I describe now a few primitive shapes in the context of ray marching. In what follows the rays and normals are supposed to be expressed in the shape local coordinates system. Points along the ray are defined as \(\vec{P}=\vec{O}+t\vec{D}\), \(t\ge0\). By canonical I mean "before applying any transformation".

Plane

A canonical plane has its origin at the coordinates system origin and lies in the x-z plane. The implicit function of the plane is \(\vec{P}.\vec{y}=0\). The intersection of the plane and the ray can be calculated by solving \((\vec{O}+t\vec{D}).\vec{y}=0\) which is simply \(t=-o_y/d_y\), for \(t\ge0\).

The normal of the plane at the intersection point is always equal to \(\vec{y}\).

Sphere

A canonical sphere has its origin at the coordinates system origin and a radius equal to 1. The implicit function of the sphere is \(||\vec{P}||=1\). The interesection of the sphere and the ray can be calculated by solving \(||\vec{O}+t\vec{D}||=1\), equivalent to \(||\vec{O}+t\vec{D}||^2=1\) to get rid of the square root. This expands into the quadratic equation \(at^2+bt+c=0\) with \(a=(d_x^2+d_y^2+d_z^2)\), \(b=2(o_xd_x+o_yd_y+o_zd_z)\) and \(c=(o_x^2+o_y^2+o_z^2-1)\), which can be easily solved for \(t\ge0\). If there are two solutions, the smallest \(t\) is used (we wants the nearest interaction to the ray origin).

The normal of the sphere at the intersection point is always equal to the intersection point.

Box

A canonical box is centered on the coordinates system origin and its faces lies on the xy, xz, yz planes one unit away from the origin. The implicit function of the box is \(max(|x|,|y|,|z|)=1\). The intersection of the box and the ray can be calculated by solving \(max(|o_x+td_x|,|o_y+td_y|,|o_z+td_z|)=1\). It can be solved as follow. For a given axis \(a\), the solutions of \(|o_a+td_a|=1\) are \(t_0=(1-o_a)/d_a\) and \(t_1=(-1-o_a)/d_a\). We are interested in the nearest intersection to the ray origin, then choose the smallest positive value \(t'_a\) in \(\{t_0,t_1\}\). If \(|o_b+t'_ad_b|\le1\) for the two other axis \(b\), the ray intersects the box at \(\vec{o}+t'_a\vec{d}\). But it can also intersects it on another face, then keep the smallest \(t'_a\) for \(a\in\{x,y,z\}\). For quick detection of intersection, we can stop as soon as one valid intersection has been found.

The normal of the box at the intersection point is equal to the normal of the face containing the intersection point.

Cone

A canonical cone has the center of its base at the coordinates system origin, its rotation axis is along the y axis and its height is one unit. It has two parameters: the radius at the base \(r_0\), and the radius at the top \(r_1\). The implicit function of the cone is \(max(|2y-1|,\frac{x^2+z^2}{(r_0+(r_1-r_0)y)^2})=1\). The intersection of the cone and the ray can be calculated by solving \(max(|(2o_y-1)+2td_y|,\frac{(o_x^2+o_z^2)+2t(o_xd_x+o_zd_z)+t^2(d_x^2+d_z^2)}{((r_0+(r_1-r_0)o_y)+t((r_1-r_0)d_y))^2})=1\). The same way as for the box it can be solved as follow. The solution of \(|(2o_y-1)+2td_y|=1\) is calculated, then \(\frac{(o_x^2+o_z^2)+2t(o_xd_x+o_zd_z)+t^2(d_x^2+d_z^2)}{((r_0+(r_1-r_0)o_y)+t((r_1-r_0)d_y))^2}\le1\) is checked for the smallest positive solution. And reciprocally for \(\frac{(o_x^2+o_z^2)+2t(o_xd_x+o_zd_z)+t^2(d_x^2+d_z^2)}{((r_0+(r_1-r_0)o_y)+t((r_1-r_0)d_y))^2}=1\) and \(|(2o_y-1)+2td_y|\le1\). If both have a solution, the smallest positive one is used. For a quick test intersection, I'll use the bounding box of the cone.

The normal of the cone at the intersection point is equal to \([x,(r_0-r_1)\sqrt{x^2+z^2},z]\) (before normalisation).

Torus

A canonical torus has its center at the coordinates system origin and lies in the xz plane. It has two parameters: its major radius \(r_0\) and its minor radius \(r_1\). The implicit function of the torus is \((\sqrt{x^2+z^2}-r_0)^2+y^2-r_1^2=0\). The square root can be removed as follow: $$ \begin{array}{l} (\sqrt{x^2+z^2}-r_0)^2+y^2-r_1^2=0\\ x^2+z^2-2r_0\sqrt{x^2+z^2}+r_0^2+y^2-r_1^2=0\\ (x^2+y^2+z^2)+(r_0^2-r_1^2)=2r_0\sqrt{x^2+z^2}\\ ((x^2+y^2+z^2)+(r_0^2-r_1^2))^2=(2r_0\sqrt{x^2+z^2})^2\\ (x^2+y^2+z^2)^2+2(x^2+y^2+z^2)(r_0^2-r_1^2)+(r_0^2-r_1^2)^2=4r_0^2(x^2+z^2)\\ (x^2+y^2+z^2)^2-2(r_0^2+r_1^2)(x^2+z^2)+2(r_0^2-r_1^2)y^2+(r_0^2-r_1^2)^2=0\\ \end{array} $$ The intersection of the torus and the ray can be calculated by solving the quartic polynomial \(At^4+Bt^3+Ct^2+Dt+E=0\) where $$ \begin{array}{l} A=c^2\\ B=2bc\\ C=2ac+b^2+f-i-l\\ D=2ab+e-h-k\\ E=a^2+r^2+d-g-j\\ \end{array} $$ and $$ \begin{array}{l} r=(r_0^2-r_1^2)\\ s=(r_0^2+r_1^2)\\ a=o_x^2+o_y^2+o_z^2\\ b=2(o_xd_x+o_yd_y+o_zd_z)\\ c=d_x^2+d_y^2+d_z^2\\ d=2ro_y^2\\ e=4o_yd_yr\\ f=2rd_y^2\\ g=2so_x^2\\ h=4so_xd_x\\ i=2sd_x^2\\ j=2so_z^2\\ k=4so_zd_z\\ l=2sd_z^2\\ \end{array} $$

An implementation for solving cubic and quartic polynomial can be found here, I'll reuse and adapt it for my needs. If there are several solutions, we keep the smallest positive \(t\). For a quick test intersection I'll use the bounding box of the torus.

The normal of the torus at the intersection point is equal to \([x-\frac{x}{\sqrt{x^2+z^2}},y,z-\frac{z}{\sqrt{x^2+z^2}}]\) (before normalisation).

Constructive solid geometry

From these primitive shapes it is possible to create more complex ones by using constructive solid geometry (CSG). There are three operations in CSG: union, intersection, difference. Each one is very simple to implement in the ray marching framework. If \(d_0(\vec{x})\) and \(d_1(\vec{x})\) are the distance function of two shapes, then \(min(d_0,d_1)\), \(max(d_0,d_1)\), \(max(d_0,-d_1)\) are the distance functions of respectively the union, intersection and difference of these two shapes. It is even super cheap to have smooth CSG which, in combination to perturbation of distance functions, I think is the most seducing feature of ray marching. The reference in term of modeling using ray marching is this page by Inigo Quilez.

Ray marching, in practice

Simple, clean, powerful, ray marching really rocks ... until it becomes a big mess. All the sources related to ray marching I've ever found use it in a static context: the shapes' distance function and the whole scene composition are hard coded functions made of functions introduced above conveniently assembled and refactored in a big plate of spaghetti. I'm aiming for a dynamic system where the input is a script written by the user where shapes are arbitrary combinations of several hierarchic levels of CSG of primitives or user defined shapes with on geometric transformation and perturbation at any level. Material available on the web fall short at explaining how to achieve that. So I'll do it myself below.

The CSG operations work on a pair of distances. This implies that these distances are expressed in the same coordinate system (or at least two coordinate systems where \(||\overrightarrow{A_0B_0}||=||\overrightarrow{A_1B_1}||,\forall A_0,B_0,A_1=M.A_0,B_1=M.B_0\) where \(M\) is the projection matrix from one system to the other). The distance functions are expressed in the local, untransformed, coordinate system of the shape they represent. So, to be used in CSG each distance must be corrected for the eventual transformations applied to each object. However, this is dependant of the ray direction. The figure below illustrates why. If we use the generic sphere distance function to calculate the distance of the next step of a ray originating from O toward \(\vec{y}\) on a scaled sphere along \(\vec{y}\), we get the distance \(OA\), which overshoots the surface of the sphere if not corrected for the scaling before stepping along the ray, and to apply this correction we need a vector while the distance is a scalar.

So, the distance function of a CSG function requires the direction as well as the position. As treating CSGs as just another primitive shapes helps a lot to simplify the implementation of ray marching, it means the interface of a distance function takes a position and a direction too. One must also take care of infinite loop due to numerical imprecision while ray marching. And the distance function must stay calculable for a null vector to allow its usage as an inside/outside test function. Altogether the algorithm for ray marching is as follow, where shape.transformations, shape.thickness, shape.roundness, shape.smooth, shape.comp are respectively the geometrical transformations applied to the canonical shape, the thickness of the shell effet ('onion' effect on Quilez website), the strength of the rounding effect, the strength of the smoothing effect on transition between components of a CSG, and the two components of a CSG:

Suddenly ray marching doesn't look so simple to implement any more, isn't it ? Well, it's not that bad and once it's done one can enjoy all its power without thinking of all the mechanic under the hood. Hopefuly the cleanup I did above will save headaches next time I want to use ray marching.

Edit on 2023/07/28: I've made a standalone and ready to use module implementing SDF, CSG and ray marching out of the work introduced here. It's available in this article.

Physically based rendering, in practice

With a definition of geometrical shapes, a way to calculate the interaction points between these shapes and the flows of light, a definition of the interaction between fow of light and materials, and between flows of light and camera, I have everything I need to implement the rendering engine. It's time for more practical problems to kick in.

In principle PBR isn't terribly complicate: follow the flow of light, bend and split it as needed when it encounters the interface between two materials, and update its SPD all along. The only 'little' problem is the number of rays needed to simulate the flow of light. Starting from the camera, except the case of an aperture infinitely small (pinhole model) corresponding to one single ray per cell entering the lens, that's already an infinite number of rays spreaded over the area of the aperture. Later, each time the light is scattered, reflected or diffracted, it adds another inifinite number of sub rays. To make it computable one must approximate these infinite numbers of rays with only a few ones, and how to do it efficiently is terribly difficult.

As long as a ray doesn't encounter a light emitting material, it won't carry any light. Once it does, it's SPD is fed by the emitting material, and it propagates to parent rays down to the camera cell affected by all the interaction with materials along the way. Starting from the camera and going backward the flow of light, it may takes an infinite number of interaction with materials before an emitting one is reached. In practice it is necessary to define a depth (number of interaction) after which the algorithm gives up following the flow of light in hope for emitting material. When this happen, there are two options: set the SPD of the ray to a default one (ambient light), or consider there is no light at all coming from that direction.

In an outdoor scene, many rays will reach the sky. If it is modeled as a dome of light (sphere encompassing the whole scene and emitting light), their SPD will be fed and the two options above for the remaining rays are sufficient to obtain nice results. In an indoor scene illuminated by lamps, very few rays will reach them. The scene looks then completely dark unless the number of rays is insanely large.

In the past the solution was to reduce at each interaction the number of subrays to the number of light sources (often reduced to a point), and to force their direction toward these light sources. If there is no object between the interaction point and the light source the subray will hit the light source immediately, job done. If not, back to the two options above. The problem is that the quality of the image suffers very much from that simplification. In reality the light bouncing off all objects in the scene affects all the other objects. Put a red box next to a white wall and the wall will suddenly look reddish, even if the box is not emitting any light per se. I choose a mixed solution: subrays in random directions and subrays forced toward objects emitting light.

In term of quality of the result image there is a tug of war between random directions and uniformly distributed directions. If completely random the image looks extremely grainy, no matter how many subrays are used. If uniformly distributed, the grain disappears but artefacts appears due to the uniform discretisation. I found that a randomly rotated Fibonacci lattice (per interaction) gives the best results. The nice distribution of the Fibonacci lattice reduces the grain, and the random rotation avoid the artefacts. The remaining grain gives a touch to the image that reminds me to the grain of analog photography, which I like, and its amount can be controlled with the number of cell per pixels (the more cell, the less grain). For the number of subrays in random direction, I found that an adaptative solution gives the best solution. This number decreases exponentially with the depth of the subrays.

About subrays forced toward light sources, one subray per light source produces sharp unrealistic shadows. Instead, I use 7: one toward the center \(\vec{c}\) of the object emitting light (a priori always the origin of the object's local coordinate system), and 6 toward locations around it \(\vec{p_i}=\vec{c}+d\{+\vec{x},+\vec{y},+\vec{z},-\vec{x},-\vec{y},-\vec{z}\}_i\), where \(d\) is an estimation of the size of the object (I'm using a constant, but an automatic estimation using the signed distance function is possible and preferable). \(\vec{p_i}\) gives the uniform distribution component, and for the random component I actually direct the subrays toward \(\vec{p'_i}=r_i\vec{c}+(1-r_i)\vec{p_i}\) where \(r\in[0,1]\) is a random value. Contrary to the subrays in random direction, I like to keep the number of subrays directed toward light sources constant over depth.

Pseudocode

Ok, everything put together gives the following algorithm:

Edit on 2024/06/09: Correction of the calculation of the direction of the reflected ray in the pseudo code above.

Scene description language

From the user point of view I want a scripting language similar to what POV-Ray does, but I'm too lazy to do my own one. The motivation behind the scripting language is simply to hide the C programming language to allow a non programmer to use the renderer. I found that hiding as many thing as possible behind C macros and introducing these macros as the scene description language is a particularly interesting solution.

An empty script looks like this:

The renderer is a single header file containing all the code. The scene script is actually a C program but the user doesn't need to know that. He includes the renderer engine by copying the #include and uses the SCENE macros which will do all the dirty work. This macro creates a main function rendering the scene defined by the macros argument. The user needs to compile the script asa C rogram, but that's a single line command that can be hidden behind an alias (alias render="gcc -o /tmp/renderer $1 -lm -O3 -pthread -lpng && /tmp/renderer"), needs to be done once and just forget about it. The macro looks like this:

PreRender() setups all that's needed to parse the macro's argument, then this argument itself defines the scene, and finally Render() actually render the image. The renderer instance is a global variable in the header and macros provide shortcuts to its properties. For example, setting the rendering parameters looks like this:

The scene is built using another macro, ADD_OBJECT, which takes a shape definition and a material definition. These definitions are provided as yet other macros for predefined primitives and materials. For example a scene with a golden sphere looks like this:

ADD_OBJECT clones the shape and material in argument and add the copy to the renderer list of objects. SPHERE allocates an instance of the object implementing a sphere in the renderer.

GOLD does the same for a material with gold properties. Optionally the properties of these instance can be set thanks to designated initializers. Missing properties are initialised to zero or whatever the function hidden behind the macro sets them to. For example, the previous scene with a scaled and translated sphere and user-defined material becomes:

Here SCALE and TRANSLATE are other macros, yet to keep hidden the C code to a non-programmer user. They look like this:

Last example, a CSG object is created like this:

I do not show here everything cause this article is already way to long, but with predefined shapes, materials, transformations and properties, a user already has a lot of liberty to create scenes. Of course, as it is 100% C code, a programmer could extend the scripting language with functions procedurally generating shapes and materials, or even entire scenes. I've used this scripting method to create all the images in this article. I think it's easy to read and write for normal users, and easy to extend with procedural generation by programmers.

Conclusion

In the end, I couldn't bring my implementation to a complete and stable version and from this point of view it is a failure. Still, it was able to produce the images in this article, which I feel already satisfying, especially the simulation of scattering. More importantly I've learnt a lot of new things. On the theory of light, the fact that reflection, refraction, diffraction all results from a single value (the index of refraction) astonished me a lot. The implication of the index of refraction varying with the wavelength is also stunning. What a jumble is the flows of light in a scene! The balance grain/artefact was also fun to play with. The scripting language hack using macros really pleased me and will certainly be reused somewhere else. I'm glad to have cleaned up all the logic about ray marching combined with CSG and transformations. I really couldn't find anything similar on the web so I guess/hope it will serve someone. At least it will serve me cause I definitely want to come back to this technic very soon, maybe to try to implement a real-time renderer and see what are the problems in that different context. And I hope I'll be able to revisit PBR one day too of course.