Amir and Blind Testing


Let me start by saying I like watching Amir from ASR, so please let’s not get harsh or the thread will be deleted. Many times, Amir has noted that when we’re inserting a new component in our system, our brains go into (to paraphrase) “analytical mode” and we start hearing imaginary improvements. He has reiterated this many times, saying that when he switched to an expensive cable he heard improvements, but when he switched back to the cheap one, he also heard improvements because the brain switches from “music enjoyment mode” to “analytical mode.” Following this logic, which I agree with, wouldn’t blind testing, or any A/B testing be compromised because our brains are always in analytical mode and therefore feeding us inaccurate data? Seems to me you need to relax for a few hours at least and listen to a variety of music before your brain can accurately assess whether something is an actual improvement.  Perhaps A/B testing is a strawman argument, because the human brain is not a spectrum analyzer.  We are too affected by our biases to come up with any valid data.  Maybe. 

chayro

I read somewhere Amir is a famous Egyptian movie star who grows prize winning watermelons in his spare time. It's a confirmed fact.

This anecdote has merit as a credible and verifiable snippet of gossip (for the erudite, technically known as hearsay).

I remember reading somewhere (and now I forget where) that A/B testing relies on our short term memory, which isn't the best method.

Not sure if that's true or not, but I felt it was an interesting point. But that would at least explain why a lot of people like to take their time before making decisions on whether something sounds better/different/worse.

Amir and others on this thread are absolutely right: A/B comparisons are notoriously flawed by expectation bias; that's just how our brains work. In my profession (drug discovery) we therefore use "double-blind" evaluations, where the experimenter (e.g. the audio dealer) and the patient (i.e. the customer) do not know whether they are receiving a new treatment, a standard treatment, or (in non-critical cases) a sugar pill (i.e. placebo). Only such an evaluation would either confirm or put in question Amir's well-intended measurements, in the sense whether or not the data he measures are relevant to human musical enjoyment and thus would indicate - before you buy it - if a particular gear enhances or diminishes such pleasure (which is, I suppose, what this exercise should be all about). The respective measurements would indeed have to track with the "enjoyment score" after listening to a hidden piece of new gear or an old one, while the listener and the dealer would not even know what gear is being listened to. In that sense, Harry Pearson was correct in his criticism of both: lone reliance on measurements and on A/B comparisons. He also knew that a new piece of equipment might sound spectacular at the onset, only to become fatiguing after a few hours or even days, no matter how "good" the measured data were. Psychoacoustics were a budding discipline in his early days, and we are still just beginning to understand how we make esthetic decisions, and what important part THD plays in this puzzle, if any at all.

I remember reading somewhere (and now I forget where) that A/B testing relies on our short term memory, which isn't the best method.

Relying on short term memory could prove problematic for certain individuals in some demographics.

Would you mind repeating your question?  I had to think about it.

 

He also knew that a new piece of equipment might sound spectacular at the onset, only to become fatiguing after a few hours or even days, no matter how "good" the measured data were.

This fatigue aspect is an issue with audio.. Not just with audio, but I digress.

This is where describing measurements as good, bad, or anything else is incorrect. It is data.

What can be read into the data matters.

The characteristics of amplification which contribute to fatigue may be measured and therefore predicted.

You mention THD amongst other things - yes, and some aspects are pleasing, and others are grating to the brain. (And some serve to mask certain issues in the recording process, but that’s another topic).

This is perhaps one reason why measurements are preferred over blind testing..

As for blind testing, however messy it is even at the best of times, I would lower the threshold to exclude the enjoyment or pleasing factor. Does it sound different? is a more realistic objective.

Measurements provided by Amir indicate to me which bits of gears I may or may not enjoy owning. Others have different preferences. The data in itself is neither good or bad - it is information.