Learning about CORS through a particularly nasty bug
It all started on a bright Wednesday morning, I started off with a nice cup of coffee and logged into my computer to check if our app was still doing fine. I expected it to run smoothly, but we got a bug-report, stating that our application didn't open at all. Damn... Good thing I already had my coffee at hand.
So why wouldn't it run? I checked whether this problem exists on my local machine. It didn't. The bug report came with some information about the user, so I checked his roles and credentials and they were fine. I checked out some parts of the code, but couldn't find anything. My teammates and I quickly ran out of ideas, so we decided to add more logging to the application, hoping that the error would surface in the logs. The next day, we asked the users to use the app again even though it was still broken and checked the logging afterwards, but the logging didn't help. The bug did not show up.
The strange thing was that not all of the machines displayed the bug. Other machines were fine. This is where the headaches really started...
Let's first revisit what CORS is. CORS is short for Cross-Origin Resource Sharing. It is a browser mechanism to prevent cross-site scripting. A CORS error is an error that is thrown by the browser whenever a site/app on a domain (ie. https://first.com) tries to do a request to a server on a different domain (ie. domain https://second.com).
Basically, the browser checks the request headers and asks the server if this domain is trusted, and will throw an error if it isn't. This is done during a pre-flight request. It will send the actual request afterwards.
Developers often run into this problem when they're developing locally (http://localhost:8080) and are trying to connect to a backend on the test environment (http://somebackend.company.com). The solution to that specific CORS problem would be to use a proxy.
The reason why a proxy works, is because you simply avoid the browser security mechanism. You set the proxy on the same domain as your Frontend app: localhost (which avoids the CORS errors). And then you pass all requests through to the servers you want to access data from. This is fine, since a proxy's requests will not be filtered by a browser. A backend-for-frontend is also a way to avoid CORS errors.
But I digress. Back to our problem.
okay, it's CORS. But why...?
If it's a CORS problem that occurs on some machines, but not all, is very strange. You would say that a CORS problem should occur on ALL machines, or NONE. So, then it might be related to how those machines are configured. Are there firewalls or proxies in place that modify requests? We had someone from workspace-maintenance do research about this situation. Their research took them a couple of days and they concluded that the workspace was very unlikely to be the cause of the problem. The workspaces weren't identical but the differences had nothing to do with networking. Another dead end and more days wasted.
We asked a lot of questions to the people from workspace-maintenance, the owners of the service that threw the CORS errors and also the users. We triple checked the roles. We checked whether we might be talking to 2 server instances with different security rules. All of these pathways were dead ends. So, the only thing left to do was to visit the users again and do some debugging on-site.
The first thing we tried to find out was on which computers the bug surfaced and on which it didn't. Then we'd know what computers to debug on. (Also, if we have a list of computers WITH the bug and WITHOUT the bug, then maybe workspace maintenance has more accurate info for their research.) We located a couple of them with the bug and a couple of them without the bug. But in the process, we found that one of the computers did NOT have the bug at first, but it DID have the bug when we revisited it... This meant that our conclusion about specific computers having the bug was probably wrong. We tried recreating the bug on another computer that didn't have it at first and managed to get the bug to pop up, somehow. This ruled out the possibility that it was computer/workspace-related. It is always nice to be able to rule things out, but it left us with more questions. If it isn't the workspace, then what could be the problem...?
After that, we tried to find a reproduction path: WHEN did this bug occur...? We had one of the users show us what the users generally do when starting their work-day. It included opening a lot of applications and browsertabs. Opening all of these seemed to trigger the bug. Great! We found a reproduction path!
... Or so we thought. The next step was to open the apps one by one to see if one of these apps triggered the bug. All of a sudden, the bug did not pop up anymore. Damn...
After messing around a lot, we found out that the bug was triggered whenever a user logged in and then opened ALL of his/her browser tabs all at once. This ruled out that there was something wrong with the other desktop apps. Good.
After more messing around, we found out that the order of opening browser tabs was important. If we would first open our app, then all other browser tabs, then nothing would be wrong. If we'd first open the browser tabs and then our app, then it would throw the CORS errors and break.
Now that we know that the order of opening is important, we were able to rule out different browser tabs. After opening and closing a lot of tabs we found the culprit! Another browser application (lets call it 'InterferingApp') turned out to be sharing the same resources as our app (let's call our app 'OurApp'). Meaning, they did requests to the exact same server with the exact same url. But still... Why the CORS errors? Why did they interfere with one another?
It turned out that 'InterferingApp' was on the same domain as the server. Therefore, it didn't need to send CORS headers with its requests. But still, why did that interfere? Because, as it turns out, the browser shares the cache even though you're using different tabs or windows. That means if we open 'InterferingApp' first, we have a cache full of requests WITHOUT the proper CORS headers.
Those headers are a part of the request and since 'OurApp' did requests to the same url, the browser just used the ones in the cache. Found it. Finally...
So now we could finally figure out a solution. The problem was that 'InterferingApp' did not send any headers and we did not have any kind of control over 'InterferingApp'. We needed to find a way to separate the cache entries for both apps. We learned that the browser caches requests based on url (makes sense right?). So we wrote a wrapper around the fetch api that checked all requests. If a request for the shared server was made, we'd add a query parameter to the url. Something like '?cache=OurApp'. This way, the url will be different and the browser will cache the requests separately. It is not a pretty solution, but it enabled us to be truly independent from 'InterferingApp'.
It has been quite a headache to get to the cause of the bug. In the end it took almost a month of on-and-off debugging to find it. But I learned a lot. I deepened my understanding of CORS and how the browser works. I hope you learned something from my struggles as well.
Lastly, I want to thank my great colleague Jacob van Lingen for helping me out with this bug. We have been able to pump each other up when we needed some motivation.