PINN vs FEM III - Numerical Comparison between Finite Elements and Single-Layer Perceptron
(Co-authored by Sean De Marco and Lloyd Fung)
In the previous post, we have demonstrated the theoretical link between FEM and PINN, you might begin to question why despite being mathematically equivalent, the two methods yield different results. This blog addresses that question. By bringing the theoretical commonalities into a visual sense, you will begin to understand why we say that SLP is adaptively meshed FEM in disguise.
How do they recreate solutions?
In the previous blog we discussed how elements in FEM have compact support, whereas SLP is not constrained in its optimisation and therefore each nodal activation spans the whole domain, this can be called the basis of the SLP. Intuitively you can already start to imagine how each method reconstructs the solution, have a think of what that would look like and then have a look at the image below.
So, we see that for FEM, each element is non-zero only in the region in which it acts, by summing all these individual contributions over the whole domain in a way that fulfills the PDE, the solution can be reconstructed! Now put your attention towards the SLP recreation and its nodal activations, we see that each node activates across the whole domain, just like in FEM, we can sum the contribution of each node, then the solution to the PDE can be recreated by weighting each activation apporpriately! This style of reconstruction applies even to higher dimensional PDEs, but more emphasis needs to be placed on the expressivity of the first layer weights, which is responsible for deciding the activation direction. More on this in a later post.
From this simple case, you can already see how well the SLP would be able solve a PDE. The SLP’s global nature would allow for interpolation between points across the entire domain and could surpass the FEM’s performance with the same number of optimised parameters. So why do we not replace all FEM solvers with PINN solvers and call it a day? Hoorah AI has taken over! FEM is no more… Well no. While the activation function in NN being globally supported might help the SLP to use less nodes compared to FEM, it also comes with some costs. One of the cost is that the training cost scales worse than FEM solving in more complex solutions due to backpropogation through the whole network. The other is the violation of information structure, both of which we have briefly discussed in the previous post. Here, we will focus on another Achilles’ heel of PINN - spectral bias.
The curse of being too expressive
It’s often argued that FEM is better than PINN because it doesn’t have the spectral bias that plagued PINN. That is actually not entirely true. FEM is also inherently bias in its spectrum - the mesh density determines the highest frequency FEM can resolve. The idea that FEM is not as spectrally-biased is because that bias is explicitly and trivially shown and actively removed by users. By construction, FEM’s trial function obeys a certain mesh. This implies that if we increase the mesh density, the equivalent first-layer weight $\mathbf{W}^{(1)}$ in the SLP representation of FEM also increases. This is why if we have higher frequency signal in the solution, all we have to do in FEM is to increase mesh density such that we resolve the solution in the frequency it needs to be. In other words, the act of meshing to convergence by the user is the way FEM removes the spectral bias.
SLP, in constrast to FEM, do not have this fixed correlation between the first-layer weights $\mathbf{W}^{(1)}$ and the number of nodes $H$ built-in. Instead, $\mathbf{W}^{(1)}$ and $H$ are free to vary independently. This flexibility is a major reason for SLP’s higher expressivity compared to FEM - it has more freedom to choose the width of the activation, i.e. how rapidly the activation function rise from $0$ to $1$, its location of activation, and how many exist by vaying the hidden dimenions, $H$. However, it is also this expressivity that is the source of spectral bias. The fact that the first-layer weights $\mathbf{W}^{(1)}$ is no longer tied to a mesh implies that we rely on either a smart choice of its value when we are training an ELM, or in a more traditional way to train NNs, on the optimiser to do the job of tuning the weights towards the right order of magnitude to resolve the spectrum of solution. The problem is that the optimiser may not always optimise $\mathbf{W}^{(1)}$ towards the right spectrum or adjust the bias $\mathbf{b}^{(1)}$ such that the activation functions are activated in the correct locations. Sometimes, it may get stuck in a low frequnecy minimum, which matches the solution at low frequency, but not in high frequency. Hence, spectrum bias can be thought as a failure for the optimiser to converge to the global minima of a loss.
An illustrative example through ELM
In the previous post, we introduced Extreme Learning Machine (ELM). ElMs are a really strong SLP which freezes the first layer parameters and analytically solves the final layer, this architecture is very strong and can be used to demonstrate the above points on spectral bias well.
Let us examine this spectral bias with a test case featuring one large structure and one smaller structure having 0.5 and 5 Hz frequencies respectively, crudely recreating a multi-scale task. To highlight how spectral bias occurs, we will continue to use ELM with a fixed weight of $\mathbf{W}^{(1)}=\mathbf{1}$. For FEM, we assumed that a sufficiently fine mesh would capture both large and small scale features, as seen in the image below, it did. However, the ELM solution failed to resolve the high frequency component. Of course, this is expected - the fact that we fixed $\mathbf{W}^{(1)}=\mathbf{1}$ means no matter how big $H$ is ($H=1000$ in this case - which is very over prescribed for this problem), the activation function can never resolve the higher frequency signal in $x$ since the basis is far too wide to capture the quick variations of the small scales. This is what enables spectral bias to occur. FEM, in contrast, has the equivalent of $\mathbf{W}^{(1)}$ scaling with the mesh density (equivalent to $H$), which allows it to resolve higher frequency with denser mesh structures.
So, how do we fix it?
Given the deliberate example above, the fix should be pretty appearant, right? Just multiply $\mathbf{W}^{(1)}$ by some arbitrary scalar to resolve the high frequency! By varying the first layer weights $\mathbf{W}^{(1)}$, we are effectively stretching or contracting the activation function on $x$, allowing the activation functions to represent different frequency in $x$. Meanwhile, varying the first layer biases $\mathbf{b}^{(1)}$ allows us to shift the activation function to specific locations after rescaling.
Let us see this in action using ELM. Previously, we fixed the first layer with constant weights $\mathbf{W}^{(1)}=1$ and uniformly distributed the first layer bias $\mathbf{b}^{(1)}$ across the whole domain. The node-output weights were then optimised in a linear least squares fashion (i.e. through ELM). As discussed above, the result wasn’t so great. Now, let us rescale $\mathbf{W}^{(1)}$ according to the mesh density by multiplying it with some constant scalar. By rescaling $\mathbf{W}^{(1)}$ to match the frequency needed to resolve the solution, and shifting $\mathbf{b}^{(1)}$ to keep the distribution across the domain as before, the SLP can now resolve the higher multi-scale solution much better, achieving a similar accuracy to FEM.
So what happens when we train the SLP fully instead?
Now that we know we need to vary $\mathbf{W}^{(1)}$ according to the the high frequency solution, you are probably asking, can’t the optimiser do this automatically?
Well, the answer is, sort of? It really comes down to the initial guess of parameters and the specific optimiser you are using. We can think of this more mechanistically. The optimisers’s objective is to minimise the loss. Since the dominant structure (the low frequency signal) in the solution are more dominant in contributing to the loss, they tend to be minimised first, often at the price of finer (higher frequency) structures. This inherent bias to solve low frequency first can sometimes push the optimiser towards local minima and get stuck there instead of the global minima, giving rise to spectral bias in PINNs. This can be even more severley impacted by the initialisations of the network since, unlike in FEM, it is possible that the initialisation places some activations outside of the domain, reducing the efficiency of these activations.
Let us see this in practice. Say we have a SLP which is overprescribed, that is it has more hidden dimensions and collocation points than necessary. If we train this SLP to solve a multi-scale task and randomly insitialise, we might expect that this PINN will struggle to solve the higher frequency compnents in the solution. Across epochs, we see the large structure being resolved, followed by the smaller structures. Look at the gif below and see for yourself!
Evidently, the PINN does struggle with the higher frequencies. The gif shows the prediction evolving over epoch in training, the error at each point in the domain per epoch and the fourier transform of the error to highlight dominant frequencies in the error. We see spectral bias live in action! The solution plot shows the network quickly resolving the large structures and then attempting to resolve the high frequency components, ultimately struggling. The error and fourier transform plots show this extremely well, at the beginning of training there are 4 dominant traces, the optimiser resolves the frequencies in order however, we see that the higher frequencies persist throughout the training. This does not necessarily mean it will never converge correctly (and for such a simple problem you can say not a matter of if but when), this just might take a very long time or we can go about this in a more intelligent manner. In fact, if you know a priori the rough order of magnitude of the spectrum and initialise the parameters accordingly, chances are you may be able to converge correctly in a short number of steps. Even if you do not know the spectrum, given high enough $H$ and a wide enough spectrum when you initialise biases keeping in mind to ensure that all nodes are activated in the domain, there is a good chance you have done enough to facilitate the SLP to capture the correct spectrum and therefore able to converge. This is why a proper initial guess is important when training PINNs.
This was the mentality we had in the ELM case, where the ELM performance was strictly dependent on the initialisation since the first layer is not iteritively chaned in training. Now let us keep this idea in mind, and retrain the model taking care to initialise the biases such that the activation functions are activated inside the solution domain. By doing so, we help the optimiser focus on regions in the domain to refine rather than risking the optimiser using out of domain activations for refinement.
In the same amount of epochs as the previous randomly initialised SLP, the reinitialised SLP now managed to converge to all frequencies in the solution! All by just making sure one small aspect was included in our initialisation.
Through these tests we highlight two extremely important points on spectral bias. The first is that spectral bias is a temporal problem, it evolves over train time and the optimiser sequentially resolves scales. The second is that spectral bias is heavily influenced by the model initialisation, we demonstrate that with smarter initialisation we “overcome” the spectral bias by making sure our parameters give a solid foundation for the optimiser to build on. Tying this back to what we had done in the ELM, we intuitively rescaled the activations to fit the solution, and we see that the optimiser, with the help of a good initialisation, does the same.
Interstingly enough, this does not necessarily imply that FEM will not suffer from the same problems. The fact that the mesh is inherently tied up to the resolution of the domain, implies that if the high frequencies are not resolved, its biases will clearly reflect this to the user as poor converegence, nudging the user to increase meshing to resolve high frequencies. Even we show a spectral bias! Are mesh refinement studies just us humans manually overcoming our spectral bias?
On a more serious note, in higher dimensions, the thousands more parameters and deeper NNs used tend to exercate the spectral bias problem. The exact reason on why high dimensional problems or depth of the NNs makes the problem worse is still under invesgation. However, it is not difficult to imagine that the loss landscape tends to be more challenging to navigate in higher dimensional problems, decreasing the chances of convergence from poor resolution. By illustrating these challenges, we begin to demystify the often opaque inner workings of PINNs.
So are PINNs meshless?
Yes and No. It’s “meshless” in the sense that the weights are no longer tied to a mesh, which earns PINN higher expressivity. This expressivity, along with the overcoming of the curse of dimensionality through the depth of MLP (we will discuss that further in another post), is the main advantage PINN has over FEM. Yet, it’s not entirely “meshless” in that any arbitrary architecture can represent any arbitrarily complex solution at arbitrary precision. There is still a “mesh”, that is the NN architecture itself, that needs to be tuned according to the solution. This tuning will likely scale with the complexity of the solution it represents. Hence, in many ways, the NN architecture is the new “mesh”, and all the difficulties of generating meshes in numeric methods will also likely apply to the design of NN architecture. In particular, like FEM, to converge to a correct solution, we actually require some prior knowledge of the solution itself!
Summary
In this post, we numerically demonstrated the implications of what we called adaptively meshed FEM in disguise. Particularly, we focused how the globally supported nature of SLP activations is a double-edged sword, providing great flexibilty but coming at the cost of a lack of convergence guarantee and well-conditioning.
First, we showed that the first layer weights and biases are responsible for controlling the width, direction and location of the activations. Through the simple Poisson BVP tasks, NN’s flexibility was used to demonstrate the simplest form of spectral bias in PINNs. When the ELM was not initialised for the right frequency, it struggled to converge to a correct solution, ultimately converging to a low frequency solution and ignoring the high frequency profile in the task. There, we can say that the ELM demonstrated spectral bias. However, tuning the activation width (i.e., tuning $\mathbf{W}^{(1)}$) helped the ELM to converge much better. In the tasks, the sigmoid activations were contracted such that they were able to represent the small scale features in the problem, we essentially mimicked FEM by tying the mesh density (the hidden weights and biases) with the number of cells (the number of hidden dimensions). This would be similair to adapting a mesh in FEM to account for areas of high variation.
Secondly, we wanted to check whether the optimiser in training an SLP would do these adjustments automatically. By setting up a multi-scale task and overprescribing the SLP, we observed the spectral bias first hand, with the optimiser having a large sensitivity to the initialisation of the parameters. In the first case, the SLP was initialised totally randomly and demonstrated heavy spectral bias, all the large scales were solved but the lowest scale remained unresolved. In the second case, we borrowed the idea from the ELM and ensured that the initialisation ensured that each activation will be activated in the solution domain. This worked extremely well and enabled the optimiser to quickly resolved all the scales in the task. These tests demonstrated the temporal nature of spectral bias and how the scales are sequentially solved and that intelligent initialisations promote overcoming spectral bias.
The ability to control the width, direction and location of activations puts into question whether PINNs are truly meshless. We choose to (very cautiosly) crash the meshless party and say that SLPs are not as meshless as people would like them to be, they can arguably be interpreted as some globally supported non-linear FEM with a neural network’s flexibility, where the mesh is hidden in the opacity of the model’s architecture and the convergence guarantees are hidden in the black-box of the architecure-initialisation-optimiser relationship. This means if you are using PINNs, you are still “meshing” implicitly when you design an architecture that works. And that means the same old challenges of mesh generation, adaptivity and prior knowledge still apply.
So, if PINNs are just adaptively meshed FEM in disguise, the question is not whether to use them, but rather how to use them better. How can we leverage their adaptivity without falling into the same old traps? Perhaps the answer lies in information theory. But enough for today, that is for a future blog post.