Why Do So Many Audiophiles Reject Blind Testing Of Audio Components?
Because it was scientifically proven to be useless more than 60 years ago.
A speech scientist by the name of Irwin Pollack have conducted an experiment in the early 1950s. In a blind ABX listening test, he asked people to distinguish minimal pairs of consonants (like “r” and “l”, or “t” and “p”).
He found out that listeners had no problem telling these consonants apart when they were played back immediately one after the other. But as he increased the pause between the playbacks, the listener’s ability to distinguish between them diminished. Once the time separating the sounds exceeded 10-15 milliseconds (approximately 1/100th of a second), people had a really hard time telling obviously different sounds apart. Their answers became statistically no better than a random guess.
If you are interested in the science of these things, here’s a nice summary:
So reliably recognizing the difference between similar sounds in an ABX environment is impossible. 15ms playback gap, and the listener’s guess becomes no better than random. This happens because humans don't have any meaningful waveform memory. We cannot exactly recall the sound itself, and rely on various mental models for comparison. It takes time and effort to develop these models, thus making us really bad at playing "spot the sonic difference right now and here" game.
Also, please note that the experimenters were using the sounds of speech. Human ears have significantly better resolution and discrimination in the speech spectrum. If a comparison method is not working well with speech, it would not work at all with music.
So the “double blind testing” crowd is worshiping an ABX protocol that was scientifically proven more than 60 years ago to be completely unsuitable for telling similar sounds apart. And they insist all the other methods are “unscientific.”
The study you linked to, and seem to have based some of your assertions on, might be flawed by design. At least to half of the Audiogon crowd...
"All of the stimuli were digitized and their waveforms were stored on the Pulse Code Modulation System..."
Are we going to accept digital as a reliable method? Here?
Additionally, subjects in the testing...
"All were right-handed native speakers of English..."
How would the results be for some other native speakers? It would have been more interesting had they included multiple groups of subjects. Maybe, they would have found that native Korean speakers are much better at figuring differences than English ones. Even then, digital? As a proof on Audiogon?
"Because it was scientifically proven to be useless more than 60 years ago."
Interestingly enough, abstract of that article (I could not find the whole article) starts with emphasis on "the extremely acute sensitivity of a human listener to discriminate small differences". The rest of that abstract leaves a lot to be desired.
"In contrast to the extremely acute sensitivity of a human listener to discriminate small differences in the frequency or intensity between two sounds is his relative inability to identify (and name) sounds presented individually."
"A speech scientist by the name of Irwin Pollack have conducted an experiment in the early 1950s."
Which one of Dr. Pollack’s studies are you referring to? Those two from the 1950’s do not seem to support the assertion that "blind testing" is useless. The one from 1971 (https://link.springer.com/content/pdf/10.3758/BF03329012.pdf), even less so. Are you sure it was Dr. Pollack’s work you were referring to? I did not go through all the references in that article you provided link to, just through those by Dr. Pollack.
I think this was actually an older thread, and they just changed the names of the OP and the responders so it looks new...I think they do that often for certain topics...
I agree with @Erik_Squires sentiment: "Double blind studies are for product development, not for making a choice in buying." Does anyone believe that one cannot discern very large differences (e.g., orders of magnitude) in music reproduction? No. Very subtle differences, that is a different story. There is an interesting Darko.Audio podcast from 19 April in which Jonathan Novick of Audio Precision is interviewed. Mr. Novick explains how "super listeners" were selected from average listeners, only a fraction of the entire interview was devoted to this. The degree to which these super listeners could discern minor changes within a narrow frequency band, and identify which band, is very interesting! I would wager that many of you posting would qualify as super listeners. A-B testing, blinded or no, performed by an experienced reviewer with discerning ears provides a possibly useful data point guiding the consumer on their quest, nothing more. I reject the Op's thesis statement as it pertains to the average audiophile listening at home for the following reasons: truly blinded tests are 1.) very difficult, 2.) very time consuming, 3.) and probably expensive to set up. And no one else cares about the end product!
"A-B testing, blinded or no, performed by an experienced reviewer with discerning ears provides a possibly useful data point guiding the consumer on their quest, nothing more."
I would add "...performed by an honest and experienced reviewer...", meaning "with no conflict of interest of any sort".
"...as only 1 person is making a claim."
Not around here.
You may need to hire a scheduler.
The line is longer than in front of the Nike store on the day of a sneaker release.
I'll jump in and add my point of view. I'm a cognitive psychologist and, also, appreciate good audio. As a psychologist I have worked most of my professional life quantifying people's perceptions of products using various psychometric techniques. I correlate perceptions with physical features of products to inform designers, engineers, and marketing specialists what product features are most closely aligned with the desired experience for their intended market. Most of my research is "blind" in that my subjects do not know the origin of any particular "stimulus" (product) that they experience and this is done to reduce bias from unintended influences such as brand identity or product features that are not relevant to the research.
There are some important additional points I wanted to make for audio. Sensory scientists have found it necessary to distinguish between two types of perception. First is the veridical description of a sensory stimulus, i.e., it's visual, auditory, tactile, gustatory, or olfactory qualities. This strikes me as being similar to what most audiophiles strive for. It is noteworthy that such a so called descriptive analysis is left in the sensory sciences to trained experts. The image of a wine connoisseur may come to mind. The reason is that most "average" people lack both the sensitivity to detect subtle physical properties of products as well as the vocabulary to reliability describe them.
Most companies take their typical consumers into account with a second type of perceptual measurement, which describes the subjective perceptual experience (i.e., feelings and emotions) of their customers. In this case, a sample of people is required because results are based upon statistical estimates of a sample of perceptual judgements. But, consumers are asked very different questions than experts such as, "do you like it?", or "is it pleasing?". In my own research I rely on the psychological theory of Semantic Differentials, which describes three underlying psychological dimensions of experience: a) Valence (like/don't like), b) Strength (strong/delicate), and Arousal (stimulating/relaxing). In addition, I find a fourth Semantic dimension of Novelty (familiar/unfamiliar) is required to describe the full perceptual experience of actual consumers. These psychological dimensions are common to all humans (accounting for differences in language) and are bipolar in nature. That is, for each dimension experience ranges between two polar opposite extremes with a "neutral" point in the middle.
Importantly, only Valence has an obviously preferred polarity, i.e., "don't like" is always a bad thing. The other three Semantic dimensions may range anywhere between the bipolar extremes depending upon one's design goals. So, how do you know what is the best product? A similar analysis of a comparison stimulus may serve that purpose and that seems similar to what is often described in the audio community. But, an imagined "ideal" experience may also be used. I have had subjects in my research imagine their "ideal" product and rate it prior to experiencing the actual products and that provides a target experience profile for actual products. Differences between target and actual products may then be statistically compared within each Semantic dimension.
An important consequence of the multidimensional approach is that two or more products may be both similar to one another with respect to one Semantic dimension and different from one another with respect to other Semantic dimensions. This might explain the seemingly never-ending debate about whether audio systems are different or not. The answer may be that they are both similar and different depending upon which Semantic dimension of experience you attend to. In my research, I always report the entire profile of Semantic scores for each product so that similarities and differences may be directly compared.
One last point (finally!) is that different physical properties of products correlate with scores on each of the four Semantic dimensions. This provides actionable information to fine tune a product to a particular desired level of Semantic experience. I have done a good bit of this sort of thing in my career including with acoustics. Specific acoustic requirements for a particular Semantic profile can be obtained by correlating various acoustic metrics (or expert judgements) with the Semantic scores for a sample of consumers. I have used similar methods to define requirements for visual and tactile qualities of products as well.
artemus_5 Science is the art of reducing the field of the unknown by making objective observations and measuring them. Religion is the art of camouflaging the field of the unknown with dogmatic certainties that are usually not correlated by objective observation.
Sorry, but blind testing, if set up so that you can switch between two components or wires, may not be the "be all and end all" of what sounds better, but it is close, since you can immediately listen to each component and see if you can hear a difference. It also eliminate's listener's (or purchaser's) bias. So it is puzzling to me why this topic is controversial at all. Acoustic memory is notoriously bad, and so by the time I switch out cables or components, it is hard to accurately pinpoint all of the subtle differences. Having a proper switching set-up to allow immediate comparisons would have probably saved me a lot of money!
Thank you, dletch2, for your post. There is much subtlety buried in these questions of perception. For instance, what is the difference between preference and liking? Liking (the "good" pole of the bipolar Semantic dimension of Valence) is always a "good" thing (pun intended). But, what if two products are equally well liked (i.e., equally "good") AND different in other apparent semantic qualities such as strength, arousal, or novelty? Knowing which one is preferred helps pick one. But, unless you measure these other semantic dimensions you don't know why it's preferred. I come at this indirectly by having my subjects rate their imagined "ideal" product, which gives me target values on each semantic dimension to shoot for. But, I have to assume that my subjects are familiar enough with the products to know what "ideal" looks, sounds, feels, tastes, or smells like for them. In practice, does not seem to have been a problem.
I will also add that my interest is not in knowing whether people can detect a difference between products, but where in the multidimensional perceptual space any given product lies. Products that are close to one another in perceptual space are more similar to one another than products that are farther apart. Here, the words of my graduate advisor, Dr. David Lane, come to mind: statistics can tell you whether two stimuli are reliably different, but they can't tell you whether that difference makes a difference to your target audience. There, you have to know something about preference.
As for fatigue, my subjects evaluate products one at a time. So, they give each product their full attention. But, I evaluate many products in the same study (typically 15 to more than 30), so there is a possible element of fatigue. I handle that with counterbalancing. Across subjects, each product appears equally often at the beginning, in the middle, and at the end of the presentation sequence.
Again, thanks for your comments, dletch2! They make me think.
@ OP - I have not read the whole thread and apologize for any repetitive content. The use of terms as "ABX" testing" and "blind testing" are generalizations that do not allow to assess the appropriateness of certain methods for the verification of claims or the existence of phenomena. As usual, the devil is in the detail, and we have to look at the specific design of a study and the underlying hypothesis before we can judge the quality and usefulness of a given study design and method. This is particularly true when we want to establish that a subjective preference is real or the result of bias.
If, for example, a person claims that a cable or a fuse makes a clear, audible difference in the sound of a system, an appropriate simple study design would be a along the lines of (1) this one person (2) on 12 consecutive days listens to (3) the same program on the same system in the same room, comparing the claimed superior component with the same standard component on each day. The test subject is 'blind' to the active component and is asked to identify which one is active. From this we would learn whether the audible difference indeed 'exists' for this person making that very claim. We would also learn something semi-quantifiable, i.e. whether this is marginal at best or "crystal clear". For a clear effect (often touted as dramatic or transforming) this study would be statistically powered with 12 data points. It would be objective.
If, on the other hand we don't want to show the mere existence of a phenomenon ("I can hear a difference when measurements don't detect a difference."), yet we want to determine a preference which holds true for many people, we need more test subjects.
Yet, to articulate a preference when the existence of an audible difference between two components cannot be established in the first place (specifically, when the very person claiming the existence of a preference cannot reliably differentiate between two components) seems unreasonable or arbitrary.
Having said that, when describing components as "synergistic" without being able to not only establish their discrete effect experimentally, but also quantify it, the use of this term seems baseless. In order to establish synergy, one needs to be able to detect AND quantify. And when a company uses the name "Research" in their name and claims there are no methods to measure or test critical product performance parameters, and also does not have any data applying appropriate blind listening data (see above), I am extremely skeptical. In fact, I am not interested.
@nonoise, I am not sure what "aha" you think @jakleiss has introduced to this topic? Nothing in what he wrote suggest that blind testing is a bad idea.
Is that some more of your remote viewing capabilities?
Testing is cool and all. But seriously, this is something that manufacturers should do as part of R&D.
Im pretty sick of the garage audio scientists putzing around with their 75 dollar mics and charts claiming authority over the audio universe. I have a life. I spend my spare listening to music and fiddling with my various hobbies. IDGAF about condescending zealots.
Interesting thread. Just my two cents, but don’t we all listen in order to choose? I prefer that if I am listening to a couple different things and trying to decide between them, I would rather that I not know which I am listening to in order to take away any bias--I think we all know it works, even if we don’t have graphs etc to back up our findings.
In the end, isn’t is just about choosing what we like to listen to? Even if a person did a AB comparison or a ABX or name your comparison, isn’t that just what sounds good to them? Doesn’t mean I have to like it--maybe I like those $500 speakers instead of the $5000 ones; maybe they are the opposite. I have a friend who likes Marmite, and I think it tastes disgusting. Regardless of any testing, I am never eating that stuff again, even if he tells me over and over that I should like it.
The never ending quest to show that what you have or like is better than what another person likes or owns, or to tell another person that they are morons for purchasing x or y is nothing but pride. It just gets in the way of a community of listeners enjoying the hobby and turns the whole thing into an adversarial mess.
Setting up a proper blind test is very difficult and must address several issues while other issues with the test cant be controlled. You must guarantee the precise volume and you must have the ability to switch very quickly (seconds).
One of my friends was flown to participate in one of the Harmon tests involving, I believe, a new Infinity speaker. These new speakers were up against B & W 801s. Output was identical with 3 to 5 second snippets of the same music played between the 2 speakers with preferences tallied by the listeners 20 times. I think that he said it was all in mono and he was the only person to pick the B & W every time.
When I asked him how it was he said that it was actually somewhat stressful and he thought the test flawed on numerous levels. First and foremost he thought that the upstream gear had a significant impact on the sound of the speakers and that this aspect only proved which system configuration sounded better at that moment.
So even when dletch2 does his informal blind tests, he only proves that in his system and at that moment his taste says that one component betters another.
Well how can you fault a business for trying to manufacture a market? And it is different in that you can buy a product and decide for yourself, using whatever methods you choose, if the item is worth the asking. A tangiable product which you can audition and compare against other similarly priced products. It isnt even remotely the same thing and you suggesting it is similar suggests that you are simply another casualty of this age.
I believe he was talking about the point of advertising--advertisers use our own nature (and pride) to sell their products, making us believe they are better, more luxurious, more this more that. Some products may be better, and I think my post above specifically pointed out that comparison of different models and coming away liking whatever it is you picked out is a good thing. What’s not good is when you buy it and then lord it over others (or when you don't and then berate others who did). Not sure why there is an issue with that.
Me again. I commented above on the psychometric technique of Semantic Differential rating scales, which is the primary technique I use in assessing perception because it is easy for participants to use and it differentiates products along multiple dimensions simultaneously. However, I have also used another psychometric technique, which is based upon similarity judgements. It strikes me that this touches upon the issue of AB and ABX testing. I collect similarity data either by having participants in my studies rate the degree of perceived similarity between pairs of items, or with a "triad" method in which items are presented three at a time and the participant selects the one that seems most different. Each triad provides input to three cells of a (dis)similarity matrix. Cells for the two items that weren't chosen as being "different" are coded "0" for minimum dissimilarity and cells for the two pairings involving the "different" item are coded "1" for maximum dissimilarity. I sum the matrices for individual participants and analyze it using Multidimensional Scaling to map all items within a multidimensional perceptual space.
The rating method provides what is essentially a continuous metric for the AB comparison and the triad method provides something that is functionally similar to ABX except with three different items and the question being reversed, i.e., not which two are the same, but which one is most different? The two items remaining after the odd-man-out become the "same" items in ABX.
Although many psychometricians assume that the similarity judgments and Semantic Differential ratings produce equivalent results, I find that is usually not the case. In my experience, the Semantic Differentials (using Factor Analysis) differentiate a larger number of independent perceptual dimensions whereas the similarity comparisons (using Multidimensional Scaling) sometimes reveal higher order perceptual qualities that are not easily described with words. Either way, the emphasis is on identifying the number of different ways that things vary. Perception is always multidimensional unless you intentionally restrict the range of variation. Ironically, that is exactly what happens when you compare only two or a small number of items.
Well after you buy something what you choose to do is up the the individual and I feel it no more likely with an expensive product than a cheaper one. We see clues of this everywhere on this forum with suggestions that a Magnaplanar or horn speaker is better than the Wilson or other types. I see nothing wrong advertising goals because questioning proper etiquette removes individual responsiblity from the experience. You might say that the person who succumbs to advertising and buys a very expensive speaker was manipulated but I would never presume to know the persons motivation. Nor that they were inspired some dletch2 bias or for any other reason that it sounded better. Why dont some of you admit that there is a huge moral piece to all of this in terms of levels of accepted consumption (see Wilson thread).
Seriously dltech2, you have to come up with something better than cults. You are wasting my time with these stretches. You are the one asking us to join the blind listening cult and telling us that without this method we are being duped. We are not asking you to change anything in your regimen we just propose that it doesnt accomplish what you think in all cases and are not asking you to join our cult. Make no mistake, as per your definition, these are both cults apparently.
Yes it’s a very imperfect world out there but Audiogon members on threads are working hard to somehow make it better. Thank G-d they got their hifis and what they think they hear to help stay sane.
What they think they hear is the same as what they hear!
dletch2, You are in no position to make this judgement. This is my whole point. Just because you seem to find solace in double blind tests doesnt mean that our observations without are any less viable. Those of us who have been around the block more than once are comfortable with our decisions and your validation mean nothing. At least not to me. Nothing helps critical listening more than experience and detailed comparisons between components sighted or not.
So lets whip em out, place them on the table and compare. Give me your background in terms of 2 channel and your current system specifics. Analog or digital, etc. Also list room treatments, dimensions and any other relevant information like age. If you want to include degrees and your profession that is fine.
Please stop trying to place all enthusiasts in the same small box simply for your peace of mind. And stop ascribing the same motivations and biases to all as if this were some cosmic constant.
So even when dletch2 does his informal blind tests, he only proves that in his system and at that moment his taste says that one component betters another.
Indeed, what has been proven? It seems that some here want to get away from all biases as if that is even possible? Suppose the guy next to you or the operator gives an approving gesture or says "Wow!" every time he likes the sound? Then what bias kicks in? We are human, different & unique, Not machines. But even machines often have bias
Seriously dltech2, you have to come up with something better than cults. You are wasting my time with these stretches. You are the one asking us to join the blind listening cult and telling us that without this method we are being duped. We are not asking you to change anything in your regimen we just propose that it doesnt accomplish what you think in all cases and are not asking you to join our cult. Make no mistake, as per your definition, these are both cults apparently.
I Think he believes that its just a matter of putting on a blindfold. From what I have read, there is a lot more to it and really is not worth the time considering the results.
Your comments are very enjoyable but I was wondering if most of these comparative tests are done blind, at least as much as possible. If not, how do you control for bias? Are there other techniques besides blind testing you could use for difference in sound?
Please look back at PS Audio's video near the top of the discussion. It is right on the money. Now, some things are not easy to compare, like certain tweaks, acoustic treatments, etc., and correcting the volume for different outputs of components being tested is not easy (or may be impossible as louder typically sounds better), and sometimes other factors come into play, like how good a deal you got, the dealer making it easy with a trade in credit or handling of the sale of your old gear, etc. This is typically not black and white, and certainly not life or death.
Bottom line, if you don't like what you bought, hopefully it has some fans and you can sell it here or on another site.
Lets stay away from exaggerated comments like cults that have negative (or are negative) connotations. Enjoy the hobby and the gear whether you keep things for a long time or a constant tinkerer with upgradeitis.
dletch2 ... Just because you seem to find solace in double blind tests doesnt mean that our observations without are any less viable. Those of us who have been around the block more than once are comfortable with our decisions and your validation mean nothing. At least not to me. Nothing helps critical listening more than experience and detailed comparisons between components sighted or not.
Exactly. And given that this is a hobbyist’s group and not a scientific forum, the nonsense and insults from the fundamentalist naysaying measurementalists here is really getting old and becoming an obstacle to conversation.
Exactly. And given that this is a hobbyist’s group and not a scientific forum, the nonsense and insults from the fundamentalist naysaying measurementalists here is really getting old and becoming an obstacle to conversation.
Which is their wont. They are, after all, zealots.
djones51, and others, I wouldn't say that blind testing is a test methodology, but, rather, an experimental design control for confounding variables. Another option is to include those variables in the analysis and test for their effects directly. I sometimes do this with demographic factors such as participants' age, gender, level of education, cultural identity and such. But, all of this aligns with the multivariate research methodologies I like to use, which correlate information across multiple products and participants. Comparing only two things necessarily confounds any potentially relevant variables.
You must have a verified phone number and physical address in order to post in the Audiogon Forums. Please return to Audiogon.com and complete this step. If you have any questions please contact Support.