Why Do So Many Audiophiles Reject Blind Testing Of Audio Components?


Because it was scientifically proven to be useless more than 60 years ago.

A speech scientist by the name of Irwin Pollack have conducted an experiment in the early 1950s. In a blind ABX listening test, he asked people to distinguish minimal pairs of consonants (like “r” and “l”, or “t” and “p”).

He found out that listeners had no problem telling these consonants apart when they were played back immediately one after the other. But as he increased the pause between the playbacks, the listener’s ability to distinguish between them diminished. Once the time separating the sounds exceeded 10-15 milliseconds (approximately 1/100th of a second), people had a really hard time telling obviously different sounds apart. Their answers became statistically no better than a random guess.

If you are interested in the science of these things, here’s a nice summary:

Categorical and noncategorical modes of speech perception along the voicing continuum

Since then, the experiment was repeated many times (last major update in 2000, Reliability of a dichotic consonant-vowel pairs task using an ABX procedure.)

So reliably recognizing the difference between similar sounds in an ABX environment is impossible. 15ms playback gap, and the listener’s guess becomes no better than random. This happens because humans don't have any meaningful waveform memory. We cannot exactly recall the sound itself, and rely on various mental models for comparison. It takes time and effort to develop these models, thus making us really bad at playing "spot the sonic difference right now and here" game.

Also, please note that the experimenters were using the sounds of speech. Human ears have significantly better resolution and discrimination in the speech spectrum. If a comparison method is not working well with speech, it would not work at all with music.

So the “double blind testing” crowd is worshiping an ABX protocol that was scientifically proven more than 60 years ago to be completely unsuitable for telling similar sounds apart. And they insist all the other methods are “unscientific.”

The irony seems to be lost on them.

Why do so many audiophiles reject blind testing of audio components? - Quora
128x128artemus_5

Showing 5 responses by jakleiss

I'll jump in and add my point of view. I'm a cognitive psychologist and, also, appreciate good audio. As a psychologist I have worked most of my professional life quantifying people's perceptions of products using various psychometric techniques. I correlate perceptions with physical features of products to inform designers, engineers, and marketing specialists what product features are most closely aligned with the desired experience for their intended market. Most of my research is "blind" in that my subjects do not know the origin of any particular "stimulus" (product) that they experience and this is done to reduce bias from unintended influences such as brand identity or product features that are not relevant to the research.

There are some important additional points I wanted to make for audio. Sensory scientists have found it necessary to distinguish between two types of perception. First is the veridical description of a sensory stimulus, i.e., it's visual, auditory, tactile, gustatory, or olfactory qualities. This strikes me as being similar to what most audiophiles strive for. It is noteworthy that such a so called descriptive analysis is left in the sensory sciences to trained experts. The image of a wine connoisseur may come to mind. The reason is that most "average" people lack both the sensitivity to detect subtle physical properties of products as well as the vocabulary to reliability describe them.

Most companies take their typical consumers into account with a second type of perceptual measurement, which describes the subjective perceptual experience (i.e., feelings and emotions) of their customers. In this case, a sample of people is required because results are based upon statistical estimates of a sample of perceptual judgements. But, consumers are asked very different questions than experts such as, "do you like it?", or "is it pleasing?". In my own research I rely on the psychological theory of Semantic Differentials, which describes three underlying psychological dimensions of experience: a) Valence (like/don't like), b) Strength (strong/delicate), and Arousal (stimulating/relaxing). In addition, I find a fourth Semantic dimension of Novelty (familiar/unfamiliar) is required to describe the full perceptual experience of actual consumers. These psychological dimensions are common to all humans (accounting for differences in language) and are bipolar in nature. That is, for each dimension experience ranges between two polar opposite extremes with a "neutral" point in the middle.

Importantly, only Valence has an obviously preferred polarity, i.e., "don't like" is always a bad thing. The other three Semantic dimensions may range anywhere between the bipolar extremes depending upon one's design goals. So, how do you know what is the best product? A similar analysis of a comparison stimulus may serve that purpose and that seems similar to what is often described in the audio community. But, an imagined "ideal" experience may also be used. I have had subjects in my research imagine their "ideal" product and rate it prior to experiencing the actual products and that provides a target experience profile for actual products. Differences between target and actual products may then be statistically compared within each Semantic dimension.

An important consequence of the multidimensional approach is that two or more products may be both similar to one another with respect to one Semantic dimension and different from one another with respect to other Semantic dimensions. This might explain the seemingly never-ending debate about whether audio systems are different or not. The answer may be that they are both similar and different depending upon which Semantic dimension of experience you attend to. In my research, I always report the entire profile of Semantic scores for each product so that similarities and differences may be directly compared. 

One last point (finally!) is that different physical properties of products correlate with scores on each of the four Semantic dimensions. This provides actionable information to fine tune a product to a particular desired level of Semantic experience. I have done a good bit of this sort of thing in my career including with acoustics. Specific acoustic requirements for a particular Semantic profile can be obtained by correlating various acoustic metrics (or expert judgements) with the Semantic scores for a sample of consumers. I have used similar methods to define requirements for visual and tactile qualities of products as well.
Thank you, dletch2, for your post. There is much subtlety buried in these questions of perception. For instance, what is the difference between preference and liking? Liking (the "good" pole of the bipolar Semantic dimension of Valence) is always a "good" thing (pun intended). But, what if two products are equally well liked (i.e., equally "good") AND different in other apparent semantic qualities such as strength, arousal, or novelty? Knowing which one is preferred helps pick one. But, unless you measure these other semantic dimensions you don't know why it's preferred. I come at this indirectly by having my subjects rate their imagined "ideal" product, which gives me target values on each semantic dimension to shoot for. But, I have to assume that my subjects are familiar enough with the products to know what "ideal" looks, sounds, feels, tastes, or smells like for them. In practice, does not seem to have been a problem.

I will also add that my interest is not in knowing whether people can detect a difference between products, but where in the multidimensional perceptual space any given product lies. Products that are close to one another in perceptual space are more similar to one another than products that are farther apart. Here, the words of my graduate advisor, Dr. David Lane, come to mind: statistics can tell you whether two stimuli are reliably different, but they can't tell you whether that difference makes a difference to your target audience. There, you have to know something about preference.

As for fatigue, my subjects evaluate products one at a time. So, they give each product their full attention. But, I evaluate many products in the same study (typically 15 to more than 30), so there is a possible element of fatigue. I handle that with counterbalancing. Across subjects, each product appears equally often at the beginning, in the middle, and at the end of the presentation sequence.

Again, thanks for your comments, dletch2! They make me think.
Me again. I commented above on the psychometric technique of Semantic Differential rating scales, which is the primary technique I use in assessing perception because it is easy for participants to use and it differentiates products along multiple dimensions simultaneously. However, I have also used another psychometric technique, which is based upon similarity judgements. It strikes me that this touches upon the issue of AB and ABX testing. I collect similarity data either by having participants in my studies rate the degree of perceived similarity between pairs of items, or with a "triad" method in which items are presented three at a time and the participant selects the one that seems most different. Each triad provides input to three cells of a (dis)similarity matrix. Cells for the two items that weren't chosen as being "different" are coded "0" for minimum dissimilarity and cells for the two pairings involving the "different" item are coded "1" for maximum dissimilarity. I sum the matrices for individual participants and analyze it using Multidimensional Scaling to map all items within a multidimensional perceptual space. 

The rating method provides what is essentially a continuous metric for the AB comparison and the triad method provides something that is functionally similar to ABX except with three different items and the question being reversed, i.e., not which two are the same, but which one is most different? The two items remaining after the odd-man-out become the "same" items in ABX.

Although many psychometricians assume that the similarity judgments and Semantic Differential ratings produce equivalent results, I find that is usually not the case. In my experience, the Semantic Differentials (using Factor Analysis) differentiate a larger number of independent perceptual dimensions whereas the similarity comparisons (using Multidimensional Scaling) sometimes reveal higher order perceptual qualities that are not easily described with words. Either way, the emphasis is on identifying the number of different ways that things vary. Perception is always multidimensional unless you intentionally restrict the range of variation. Ironically, that is exactly what happens when you compare only two or a small number of items.
Excuse me, I meant to write above "..blind testing is NOT a test methodology...".
djones51, and others, I wouldn't say that blind testing is a test methodology, but, rather, an experimental design control for confounding variables. Another option is to include those variables in the analysis and test for their effects directly. I sometimes do this with demographic factors such as participants' age, gender, level of education, cultural identity and such. But, all of this aligns with the multivariate research methodologies I like to use, which correlate information across multiple products and participants. Comparing only two things necessarily confounds any potentially relevant variables.