And what context would that be? I certainly wouldn't use it for high gain or where LN was important.
It does incredibly poorly with lowish loads. My standard headphones are Fostex T50RP, magnetic planar headphones aimed at studio use and in continuous production since the 1980's.
These have ~ 50 Ohm resistive impedance and ~100dB/V sensitivity. And are much more transparent than any dynamic driver headphones I have experience with.
After small modifications (mainly adding damping material to the rest cavity and reinforcement of the cheap thin plastic of the whole acoustic system they come close to the transparency of STAX Electrostatic headphones.
By modern standards the T50RP is not "hard" to drive (that honour goes to DCA's Aeon) but it is a much more difficult load than most dynamic headphones, but they are much more of a workout for a headphone amplifier than common dynamic types.
Genuine question as I've done a lot of DBLTs of OPAs in Jurassic Times.
I am unsure what listening setup you used, what statistical analysis you used (ABX for example is worse than useless and cargo cult science) and what was the variable you were testing for, so I cannot comment.
In blind listening tests with fairly large numbers of subjects I was able to observe a strong preference for one specific OPA over others, despite THD & N of all Op-Amp's well below that of the source and in "non-stressful" conditions.
The same preference held when artificial stresses were introduced.
Over time at AMR/iFi we did many systemic listening tests (using preference in a somewhat complex questionaire) to find if among nominally similar items there was an observable and reliable preference for almost anything, from resistors (MELF Thin Film wins among all commodity options by.a huge margin, that surprised me) to active parts etc.
Some outcomes surprised me so much that I did re-runs to confirm that the results were reliable.
Note, my tests were always blind in the sense that listfners not only did not know identies of DUT's, but they did not know what actual variable was tested. The whole setup was tightly controlled for any possible biases, unlike the common "DB" tests widely promoted.
If the listener knows what is being tested and has any opinion or prejudice on the matter any DB test is impossible (placebo/nocebo effect).
Forced choice tests tend to create the equivalent of "examination Stress".
On the other, if I give you a permanent reference device that is arbitrarily assigned a score of 5 in each category (e.g. quality of bass, mid, treble, dimensions of virtual sound space, emotional engagement etc.) and I ask you to listen to multiple alternative devices (blind) and mark them for perceived quality AND at the end to give a preference marking overall (which one would you like more to take home) I changed the task and actually "gamified" the experience.
That leads neccessarily to radically different outcomes from common DB listening tests, which tend to be forced choice difference, which I find mostly useless anyway, as it gives me no actionable intelligence.
Add to that the incredibly poor statistics applied to the most popular ABX format (ABX) where a preponderance of a failure to reject the null hypothesis is designed in, naturally these tests return null results on all the strongest and easiest to detect audible differences.
I don't want this to get into another ABX is the gold standard vs. ABX is pseudoscience and intentionally biased towards null results debate which is OT in this thread.
If you want to debate DB testing, we should do so in a separate thread. You are welcome to open one.
Thor