Also for a device like Manley Massive Passive EQ , Thermionic The Phoenix , and Similar ?
I don't know, the problem with audio is that there are two aspects of it: there is the objective scientific way, lets call it hard science, in which we can characterize the performance of an audio equipment using test equipment and figure of merits such as THD, IMD, SNR, frequency response, crosstalk, etc... if we would to use only those figures, we could determine which audio equipment is better than other, no problem. But the problem is taste or subjectivity, people do not necessarily like the equipment with the lowest THD or IMD, SNR might be irrelevant and price is an important psychological factor. So in that case, you have to leave behind hard science and dwell into polls, using a double blind test, like an A-B test.
The problem with the double blind test is that people think that it only involves 2 things:
1.- That the person applying the experiment, of say, the difference between to amplifiers, should be unaware of which of the two is A or B. (for instance the guy switching between the two sources)
2.-. That the person participating in the experiment, should not know in any way which of the two amplifiers is A or B.
That is correct, but I might also add, that in any serious scientific study, such as the one that you would be able to publish in a peer reviewed journal, participants are required to take an audiometric test and their results must be between some specified tolerance parameters, also, the number of individuals should be representative sample, just gathering 5 friends which "you trust their ears" is not a valid sample. However, lets say that you do everything by the book, the problem with ITB mixes vs OTB mixes blind tests is that you have a multivariable problem, whilst in the "which amp sounds better?" test you have only one of them (the amp). In the ITB vs OTB mix test you have many of them, specially if you mix them separately, because if you use different equipment in analog and different plugins in digital, it becomes a multivariable problem.
There are also false conclusions, for example what the guy did in the video the OP posted. If you take a digital mix and then you take the same mix and you distort it with a TEAC mixer, and you try to level them "by ear" or by peak level, then you can conclude that the TEAC mixer version sounds better, ergo, analog is better than digital. No, what you have just concluded is that a distorted mix sounds better than a non-distorted mix, at best, you have proven that the TEAC mixer distortion sounds better than the undistorted digital version, not that analog sounds better than digital, plus, all the EQs and extra gear he added, proves that he spent a lot of time tweaking the analog mix, this is far from impartial.
The closest approach to a scientific test is something Digidesign (AVID) did years ago (2007 or 2008?), in which they asked some producers to mix some songs of different genres using exclusively an SSL mixer, and then replicated the same knob positions in the SSL emulator plug-in of the same mixer. They made a couple of mistakes though:
1.- We all know that just because some knob points to the scale which says "+3dB" it doesn't mean that it is actually boosting 3dB, pots age, have tolerance, the taper of the pot might not provide that exact boost at that position, and the knob itself might be offset to the scale. Digidesign knew this, so they said that they adjusted the positions of the knobs in the plugin by comparing them by ear with the console version, at the point that they sounded very similar with the real console. Here lies the problem, they introduced human bias and error, what they should've done is measure the output of each channel for each knob and adjust the knob in the plug-in so it matches as closely as possible to the console, this of course would've taken ages, but that was the proper way to do it.
2.- Second, they told the public which mix was the plug-in version and which was the console version, which added extra bias.
Still, with all these things, all the mixes were practically identical, there was only one in which there was a noticeable difference, but that could've been attributed to human error or bias. In any case, even if the real console version would have proved to sound better, it just proves that the plug-in emulation is not a great emulation, not that analog triumphs over digital.
There are multiple fallacies when it comes to YouTube audio "scientists":
Another common mistake I see is the "null test" which many love and use to death, they assume that if two things are the same, they must cancel each other if you perform a null test. This has a lot of problems, first, they are assuming that there is such thing as 0% uncertainty error, something that doesn't happen in any field of science, second, they assume that there is no such thing as process variation, that equipment serial number 1 is 100% the same as equipment serial number 10569. Based on this previous fallacy, they assume, that since the plug-in must have been modeled based on the real equipment, and since all of them must sound the same, then you can take any equipment you have, make a null test with the plug-in emulation and they must cancel each other out perfectly, otherwise, "the plug-in is no good".
There is of course another big caveat, that in order for a null test to perfectly null, besides being the same, the two signals should be perfectly time aligned, there are many reasons as to why these signals might not be perfectly aligned, but also, many people time align signals "by eye" by nudging the audio 1 sample left or right so it matches as best as possible to the signal they are trying to compare it to. The problem is that the minimum resolution they can nudge the audio is 1 sample. To put things into perspective: lets say you record two sine waves at different times and you time align them using the "eye" method by nudging the audio one sample at a time left or right, 1 sample at 44.1KHz is equal to 22.68 usec, so lets say that the maximum alignment error is that of 1 sample, if you try to null two 1V (peak), 1KHz sine waves and they are off by 1 sample, the RMS of the residual will be roughly 100mV, or -17.8dBu, this is definitely audible; but lets go even further, lets say that now you try to null two 10KHz sine waves and they are off by 22.68 usec, the RMS of the residual is 924mV or 1.5dBu, this means that a 1V 10KHz sine wave is -0.8dBu, and if you try to null two of them but one is 1 sample offset from the other, the resulting residual sounds 2.3dB louder than a single 10KHz wave, null test fail.
Now, with sine waves its easier to time align them by eye because they are very simple waveforms, but when you try to time align complex waveforms like a full mix, things are not that simple, and error is bound to be even greater. I can't tell you how many times I've read on forums people saying stuff like "I time aligned the signals as best as I could, performed a null test and I could definitely hear a difference". Wrong answer...
The trick to solve this, is to perform the null test in the frequency domain, rather than in the time domain. By taking the FFT of each signal, then taking the difference of their magnitudes, frequency by frequency, you get the residual but in the frequency domain, a good thing should be to apply a window to each time domain signal, something like a cosine tapered window (Tukey), but thats beyond the scope of what I am trying to explain . The big advantage of this method is that both signals don't have to be time aligned, they just need to be at the same amplitude, which can also be achieved by computing the RMS level of each signal.
I made an experiment, in MATLAB I created a 44.1 KHz sample rate, 65536 sample long 10KHz 1V sine wave which is roughly 1.49 s of audio, then I took a copy of the sinewave and shifted it one sample to the right. If I make a null test in the time domain, I get an RMS voltage of the residual of 924mV, or 1.5dBu, but applying the FFT technique I obtained a -306 dBu residual, which is just zero difference for all practical purposes. Practical results with real recorded signals wont be as good, obviously, but you get the idea....