(Co-authored by Sean De Marco and Lloyd Fung)

In the previous post, we have demonstrated the theoretical link between FEM and PINN, you might begin to question why despite being mathematically equivalent, the two methods yield different results. This blog post addresses that question. By bringing the theoretical commonalities into a visual sense, you will begin to understand why we say that SLP is adaptively meshed FEM in disguise.

How do FEM and SLP approximate solutions to a PDE?

In the previous blog we discussed how trial functions in FEM have compact support, whereas the activation functions in SLP is globally supported, i.e. each nodal activation spans the whole domain. Here, we are going to call the trial functions in FEM and each node after activation function in the hidden layer of SLP the basis function. Intuitively you can already start to imagine how each method reconstructs the solution through weighted summation of these basis functions.

To visualise these summation, let’s look at the plot of each basis functions together with the reconstruction of FEM and SLP.

The basis recreation (dashed lines) of a polynomial function using FEM overlaid on the optimised solution

The nodal recreation (dashed lines) of a polynomial function using SLP overlaid on the optimised solution

So, we see that for FEM, each basis function is non-zero only in the region in which it acts. The approximation to the solution is reconstructed by summing all these individual contributions over the whole domain in a way that fulfills the PDE. In this way, FEM “solved” (or approximated the solution of) the PDE! Now put your attention towards the SLP reconstruction with its basis functions (i.e. output of the hidden layer). We see that each node activates across the whole domain, i.e. globally supported. Just like in FEM, we can sum the contribution of each node with appropiate weighting to reconstruct the solution. This style of reconstruction also applies to higher dimensional PDEs, but there are some caveats in scaling. More on this in a later post.

From this simple case, you can already see how well the SLP would be able solve a PDE. The SLP’s global nature would allow for interpolation between points across the entire domain and could surpass the FEM’s performance with the same number of optimised parameters. So why do we not replace all FEM solvers with PINN solvers and call it a day? Hoorah AI has taken over! FEM is no more… Well no. While the activation function in NN being globally supported might help the SLP to use less nodes compared to FEM, it also comes with some costs. One of the cost is that the training cost scales worse than FEM. Even in the case of ELM-SLP (freezing the first layer and optimise the last layer with regression), FEM is much faster due to sparsity arise from the local support of the trial function, while the global support of the activation function of SLP means the matrix to be inverted during regression is dense. The other related issue is the violation of information structure, as global support implies every node is affected by every point in space. We have briefly discussed both issues in the previous post. Here, we will discuss YET another issue of the NN approach - the Achilles’ heel of PINN - spectral bias.

The curse of being too expressive

It’s often argued that FEM is better than PINN because it doesn’t exhibit the spectral bias that plagued PINN. That is actually not entirely true. FEM is also inherently bias in its spectrum - the mesh density determines the highest frequency FEM can resolve. The idea that FEM is not as spectrally-biased is because that bias is explicitly and trivially shown and actively removed by the user. By construction, FEM’s trial function obeys a certain mesh. This implies that if we increase the mesh density, the equivalent first-layer weight $\mathbf{W}^{(1)}$ in the SLP representation of FEM also increases. This is why if we have higher frequency signal in the solution, all we have to do in FEM is to increase mesh density such that we resolve the solution in the frequency it needs to be. In other words, the act of meshing to convergence by the user is the way FEM removes the spectral bias.

SLP, in constrast to FEM, do not have this fixed correlation between the first-layer weights $[\mathbf{W}^{(1)},\mathbf{b}^{(1)}]$, and the number of nodes $H$ built-in. Instead, $[\mathbf{W}^{(1)},\mathbf{b}^{(1)}]$ and $H$ are free to vary independently. This flexibility is a major reason for SLP’s higher expressivity compared to FEM - it has more freedom to choose the width of the activation, i.e. how rapidly the activation function rise from $0$ to $1$ (defined by $\mathbf{W}^{(1)}$), its location of activation (defined by $\mathbf{b}^{(1)}$), and how many basis functions by vaying the hidden dimenions (defined by $H$). However, it is also this expressivity that is the source of spectral bias. The fact that the first-layer weights $\mathbf{W}^{(1)}$ is no longer tied to a mesh implies that we rely on either a smart choice of its value when we are training an ELM, or in a more traditional way to train NNs, on the optimiser to do the job of tuning the weights towards the right order of magnitude to resolve the spectrum of solution. The problem is that the optimiser may not always optimise $\mathbf{W}^{(1)}$ towards the right spectrum or adjust the bias $\mathbf{b}^{(1)}$ such that the activation functions are activated in the correct locations. Sometimes, it may get stuck in a low frequnecy minimum, which matches the solution at low frequency, but not in high frequency. Hence, spectrum bias can be thought as a failure for the optimiser to converge to the global minima of a loss.

An illustrative example through ELM

In the previous post, we introduced Extreme Learning Machine (ELM). ElMs are a really strong SLP which freezes the first layer parameters and analytically solves the final layer, this architecture is very strong and can be used to demonstrate the above points on spectral bias well.

Let us examine this spectral bias with a test case featuring one large structure and one smaller structure having 0.5 and 5 Hz frequencies respectively, crudely recreating a multi-scale task. To highlight how spectral bias occurs, we will continue to use ELM with a fixed weight of $\mathbf{W}^{(1)}=\mathbf{1}$. For FEM, we assumed that a sufficiently fine mesh would capture both large and small scale features, as seen in the image below, it did. However, the ELM solution failed to resolve the high frequency component. Of course, this is expected - the fact that we fixed $\mathbf{W}^{(1)}=\mathbf{1}$ means no matter how big $H$ is ($H=1000$ in this case - which is very over prescribed for this problem), the activation function can never resolve the higher frequency signal in $x$ since the basis is far too wide to capture the quick variations of the small scales. This is what enables spectral bias to occur. FEM, in contrast, has the equivalent of $\mathbf{W}^{(1)}$ scaling with the mesh density (equivalent to $H$), which allows it to resolve higher frequency with denser mesh structures.

Multi-scale task, the ELM shows spectral bias which was overcome by rescaling initialisations

So, how do we fix it?

Given the deliberate example above, the fix should be pretty appearant, right? Just multiply $\mathbf{W}^{(1)}$ by some arbitrary scalar to resolve the high frequency! By varying the first layer weights $\mathbf{W}^{(1)}$, we are effectively stretching or contracting the activation function on $x$, allowing the activation functions to represent different frequency in $x$. Meanwhile, varying the first layer biases $\mathbf{b}^{(1)}$ allows us to shift the activation function to specific locations after rescaling.

Let us see this in action using ELM. Previously, we fixed the first layer with constant weights $\mathbf{W}^{(1)}=1$ and uniformly distributed the first layer bias $\mathbf{b}^{(1)}$ across the whole domain. The node-output weights were then optimised in a linear least squares fashion (i.e. through ELM). As discussed above, the result wasn’t so great. Now, let us rescale $\mathbf{W}^{(1)}$ according to the mesh density by multiplying it with some constant scalar. By rescaling $\mathbf{W}^{(1)}$ to match the frequency needed to resolve the solution, and shifting $\mathbf{b}^{(1)}$ to keep the distribution across the domain as before, the SLP can now resolve the higher multi-scale solution much better, achieving a similar accuracy to FEM.

So what happens when we train the SLP fully instead?

Now that we know we need to vary $\mathbf{W}^{(1)}$ according to the the high frequency solution, you are probably asking, can’t the optimiser do this automatically?

Well, the answer is, sort of? It really comes down to the initial guess of parameters and the specific optimiser you are using. We can think of this more mechanistically. The optimisers’s objective is to minimise the loss. Since the dominant structure (the low frequency signal) in the solution are more dominant in contributing to the loss, they tend to be minimised first, often at the price of finer (higher frequency) structures. This inherent bias to solve low frequency first can sometimes push the optimiser towards local minima and get stuck there instead of the global minima, giving rise to spectral bias in PINNs. This can be even more severley impacted by the initialisations of the network since, unlike in FEM, it is possible that the initialisation places some activations outside of the domain, reducing the efficiency of these activations.

Let us see this in practice. Say we have a SLP which is overprescribed, that is it has more hidden dimensions and collocation points than necessary. If we train this SLP to fulfill a Poisson equation with multi-scale forcing (same problem as the previous post ) with randomly initialised parameters using ADAM as the optimiser, we might expect that this PINN will struggle to solve the higher frequency components in the solution. Across epochs, we see the large structure being resolved, followed by the smaller structures. Look at the gif below and see for yourself!

A GIF showing SLP training on a multi-scale task having plots of solution over time, error over the domain and a fourier transform of the error. Here the SLP is initialised completely randomly

Evidently, the PINN does struggle with the higher frequencies. The GIF shows the PINN solution evolving over epoch in training, the error at each point in the domain per epoch and the fourier transform of the error to highlight dominant frequencies in the error. We see spectral bias live in action! The solution plot shows the network quickly resolving the large structures and then attempting to resolve the high frequency components, ultimately struggling. The residue plot (middle), which shows the residue of the equation, and the spectrum of the residue (right) show this extremely well. Soon after the training starts, we see the optimiser resolves the lower frequencies well. However, we see that the higher frequencies residue persist throughout the training. This does not necessarily mean it will never converge correctly (and for such a simple problem you can say not a matter of if but when), this just might take a very long time.

However, if you know a priori the rough order of magnitude of the spectrum and initialise the parameters accordingly, chances are you may be able to converge better and faster. Even if you do not know the spectrum, given high enough $H$ and a wide enough spectrum when you initialise biases keeping in mind to ensure that all nodes are activated within and around the domain of $x$, there is a good chance you have done enough to facilitate the SLP to capture the correct spectrum and therefore able to converge. This is why a proper initial guess is important when training PINNs.

This was the mentality we had in the ELM case, where the ELM performance was strictly dependent on the initialisation since the first layer is not iteritively chaned in training. Now let us keep this idea in mind, and retrain the model taking care to initialise the biases such that the activation functions are activated inside the solution domain (in this case $x \in [0,1]$ instead of the typical default of $0$-centred distributions). By doing so, we help the optimiser focus on regions in the domain to refine rather than risking the optimiser using out of domain activations for refinement.

In the same amount of epochs as the previous randomly initialised SLP, the reinitialised SLP now managed to converge to all frequencies in the solution! All by just making sure one small aspect was included in our initialisation.

Through these tests we highlight two extremely important points on spectral bias. The first is that spectral bias is a problem of the optimisation process. It evolves over train time and the optimiser sequentially resolves scales. The second is that spectral bias is heavily influenced by the model initialisation. We demonstrate that with smarter initialisation we “overcome” the spectral bias by making sure our parameters give a solid foundation for the optimiser to build on. Tying this back to what we had done in the ELM, we intuitively rescaled the activations to fit the solution, and we see that the optimiser, with the help of a good initialisation, does the same.

Interstingly enough, this does not necessarily imply that FEM will not suffer from the same problems. The fact that the mesh is inherently tied up to the resolution of the domain, implies that if the high frequencies are not resolved, its biases will clearly reflect this to the user as poor converegence, nudging the user to increase meshing to resolve high frequencies. Therefore, can we argue that FEM’s mesh refinement process is equivalent to architectures, hyperparameter and initialisation tuning in training neural networks, and that by tuning them right, both methods can ultimately remove spectral bias exhibited initially due to bad meshing/initialization/architecture?

As a side note, in higher dimensions, the thousands more parameters and deeper NNs seems to exercate the spectral bias problem. The exact reason on why high dimensional problems or depth of the NNs makes the problem worse is still under invesgation. However, it is not difficult to imagine that the loss landscape tends to be more challenging to navigate in higher dimensional problems, decreasing the chances of convergence from poor resolution. By illustrating these challenges, we begin to demystify the often opaque inner workings of PINNs.

So are PINNs meshless?

Yes and No. It’s “meshless” in the sense that the weights are no longer tied to a mesh, which earns PINN higher expressivity. This expressivity, along with the overcoming of the curse of dimensionality through the depth of MLP (we will discuss that further in another post), is the main advantage PINN has over FEM. Yet, it’s not entirely “meshless”. The architecture (number of hidden nodes, number of layers, etc.) still need to adapt to how complex the solution is at arbitrary precision. There is still a “mesh”, that is the NN architecture itself, that needs to be tuned according to the solution. This tuning will likely scale with the complexity of the solution it represents. Hence, in many ways, the NN architecture is the new “mesh”, and all the difficulties of generating meshes in numeric methods will also likely apply to the design of NN architecture. In particular, like FEM, to converge to a correct solution, we actually require some prior knowledge of the solution itself!

Summary

In this post, we numerically demonstrated the implications of what we called adaptively meshed FEM in disguise. Particularly, we focused how the globally supported nature of SLP activations is a double-edged sword, providing great flexibilty but coming at the cost of a lack of convergence guarantee and well-conditioning.

First, we showed that the first layer weights and biases are responsible for controlling the width and location of the activations. Through the simple Poisson BVP problem, we demonstrated the simplest form of spectral bias in PINNs by manually tuning the first layer and training only the last layer, i.e. trainig the SLP like an extreme learning machine (ELM). When the ELM was not initialised for the right frequency, it struggled to converge to a correct solution, ultimately converging to a low frequency solution and ignoring the high frequency profile in the task. There, we can say that the ELM demonstrated spectral bias. However, tuning the activation width (i.e., tuning $\mathbf{W}^{(1)}$) helped the ELM to converge much better. In the tasks, the sigmoid activations were compressed in $x$ by increasing the magnitude of $\mathbf{W}^{(1)}$ such that they were able to represent the small scale features in the problem. Essentially, we mimicked FEM by tying the mesh density (the hidden weights and biases) with the number of cells (the number of hidden dimensions). This would be similair to adapting a mesh in FEM to account for areas of high variation.

Secondly, we wanted to check whether the optimiser in training an SLP would optimise the first layer parameters $[\mathbf{W}^{(1)},\mathbf{b}^{(1)}]$ automatically. By setting up a multi-scale task and overprescribing the SLP, we observed the spectral bias first hand, with the optimiser having a large sensitivity to the initialisation of the parameters. In the first case, the SLP was initialised totally randomly and demonstrated heavy spectral bias, all the large scales were solved but the lowest scale remained unresolved. In the second case, we borrowed the idea from the ELM and ensured that the initialisation ensured that each activation will be activated in the solution domain. This worked extremely well and enabled the optimiser to quickly resolved all the scales in the task. These tests demonstrated the temporal nature of spectral bias and how the scales are sequentially solved and that intelligent initialisations promote overcoming spectral bias.

The ability to control the width and location of activations puts into question whether PINNs are truly meshless. We choose to (very cautiosly) crash the meshless party and say that SLPs are not as meshless as people would like them to be. SLP can arguably be interpreted as adaptively-meshed FEM, where the mesh is hidden in the opacity of the model’s architecture and the convergence guarantees are hidden in the black-box of the architecure-initialisation-optimiser relationship. This means if you are using PINNs, you are still “meshing” implicitly when you design an architecture that works. And that means the same old challenges of mesh generation, adaptivity and prior knowledge still apply.

So, if PINNs are just adaptively meshed FEM in disguise, the question is not whether to use them, but rather how to use them better. How can we leverage their adaptivity without falling into the same old traps? Perhaps the answer lies in information theory. But enough for today, that is for a future blog post.