Why is Double Blind Testing Controversial?


I noticed that the concept of "double blind testing" of cables is a controversial topic. Why? A/B switching seems like the only definitive way of determining how one cable compares to another, or any other component such as speakers, for example. While A/B testing (and particularly double blind testing, where you don't know which cable is A or B) does not show the long term listenability of a cable or other component, it does show the specific and immediate differences between the two. It shows the differences, if at all, how slight they are, how important, etc. It seems obvious that without knowing which cable you are listening to, you eliminate bias and preconceived notions as well. So, why is this a controversial notion?
moto_man
I have a proposal ... double blind posting. Audiogon allows us to post our views with anonymity. Other Agoners then guess who posted which post, according to the content of the post, and the (extreme) opinions therein.
Suggestions for starter threads : "Power cords make no difference" and "SACD is killing digital".
Wellfed, in no way did I mean to imply that audiophiles are particularly susceptable to deception, either external or internal. It really does seem to be the case that humans - all of us - are wired for sensory "over detection"; nothing bad or good about it, that's just the way we are.

Don't underestimate the other attributes of audio components - things like build quality, reliability, corporate reputation, ergonomics, visual presentation/industrial design, price, etc. are all perfectly valid areas upon which to base and build preferences. Nothing bad about that, either. I strongly suspect that my Nordost ICs don't sound any better (or worse) than the overwhelming majority of alternatives (and haven't heard any differences, either), but enjoy the fact that they are technically one of best out there.
This is a very important proposal and needs urgent consideration - and should not be treated flippantly Sean. I for one am very concerned that readers of posts are likely to be biased against individual posters - for example, had I a prejudice against the Irish I may very well have not taken the last post as seriously as it deserves. The danger if we do not have double blind posting - and I believe posts should be scattered randomly about the site, just to be sure - then readers will not get the true meaning of what is posted here, as they will read with tinged spectacles.
Tinged spectacles. Are we getting carried away with this? What's next white canes and seeing eye dogs. At least the dogs might be able to confirm extreme frequency responses as well as finding the stuff that really stinks.
did I hit a nerve ? Or did I miss sarcasm in Redkiwi's response ? It was a joke guys .. let's laugh at ourselves occasionally. HiFi is a very unimportant topic in a world full of war, famine and death, and certainly not worth getting worked up about.
This is exactly why this topic is off limits at Audio Asylum. I think Audiogon should follow their excellent lead.
Huh? Redkiwi and Seandt were obviously joking. None of the recent posters seem upset. And, this topic is only off limits in the cable asylum. No rule against mentioning or discussing DBT's in the general or other specialized asylums.

Simple courtesy should be sufficient here. If someone asks, as in the initiation of this thread, "what's so bad about DBT's?," it should be obvious that he doesnt think anything is wrong with the subject, and if someone does, he might either ignore the thread or give his point of view without picking a fight.
Banning topics such as this is a very bad idea. Despite limited regressions into philosophy and politics Audiogon has consistently shown that intelligent and polite dialog regarding audio is possible.
To answer the original question, DBT is controversial because there are widely divergent views of its accuracy and applicability. One group of people feels that DBTs as a test methodology are inherently incapable of demonstrating audible differences. Another group feels the opposite. A fertile topic for discussion, in my opinion.

The rancor comes, unfortunately, when fringes on one side or the other feel the need to characterize those with whom they disagree as either "meter readers with no hearing/bad systems/no experience/etc." or as "delusional and indulging in wastful fantasy". Neither is correct (well, not in most cases), nor productive to meaningful discussion of the subject at hand.

Why some feel that this particular topic - why the controversy over DBTs - is unsuitable for discussion mystifies me. Well, not really . . .
All this talk of DBT, could anyone provide a link to any such reputable, controlled testing done in audio, including the statistical manipulations used to obtain the conclusions, so we at least know what we are arguing here? Anyone care to philosophize on the great fallibility of science in general, or even specifically on the science and statistics involved in any such testing? Alan Chalmers, anyone?
Socrates: You've asked a mouthful of questions. I'd suggest you start out with this site: www.pcabx.com, where you can download software that will allow you to conduct your own DBTs.

I don't want to get into statistics, except to say that's usually not the weak link in a DBT. As for the "fallibility of science," that's not the way I'd put it. I'd say that science is never finished, and it can always discover somethng new, or that something once "proven" to be right is in fact wrong. Science, in short, is the best explanation we have right now for whatever phenomenon we wish to explain. But you can't just wish it away. Current knowledge stands as knowledge--as fact--until somebody comes along with new knowledge that refutes it.

That said, anyone--and I mean anyone--who does serious research on either human hearing or sound reproduction uses DBTs--and ONLY DBTs. No one in the scientific community would think of doing a listening test any other way, because such tests are absolutely necessary to isolate and compare only the sound.
I think Hearhere summed up the issue well in his last post, but I would come at it from a slightly different angle. Simply put, DBT is not, in and of itself, "controversial." However, there is a great deal of misunderstanding/ disagreement regarding its use and applicability. More particularly, DBT is simply a tool, the results of which are interpreted based on statistical analysis, and must be understood in that context. While DBT does have some applicability in the audio context, it is not the be-all and end-all that some make it out to be.

There are two main problems with how DBTs are used/viewed by certain audiophiles. First and foremost, what many do not understand (but what anyone with experience in statistics can tell you) is that if there is a non statistically significant result, the DBT has not “proven” there are no differences between conditions! Rather, all that can be concluded is that the DBT failed to reject the null hypothesis in favor of the alternative hypothesis.

Second, small-trial (aka "small-N") listening tests analyzed at commonly used statistical significance levels (e.g. <.05) lead to large Type 2 error risks, thereby masking the very differences the tests are supposed to reveal.

Now breaking that down into English is a pain, but I'll give it a shot (I’m an engineer, as opposed to s statistician - thus any stats guys feel free to correct me). In a simple DBT, one attempts to determine if there are audible differences between two conditions (such as by inserting a new interconnect in a given system). This is more commonly called a hypothesis test - the goal is to determine whether you can reject a "null hypothesis" (there are in fact no differences between the two conditions) in favor of a "conjectured hypothesis" (there are in fact differences between the two conditions).

In a DBT, there are four possible results: 1) there are differences and the listener correctly identifies that there are differences; 2) there are no differences and the listener correctly identifies there are no differences; 3) there are no differences, but the listener believes there are differences; and 4) there are differences, but the listener believes there are no differences. Obviously, 1 and 2 are correct results. Circumstance 3 (concluding that differences exist when in reality they don’t) is commonly referred to as "Type 1 error". Circumstance 4 (missing a true difference) is commonly referred to as "Type 2 error". Put in terms of the hypothesis test stated above, type I error occurs when the null hypothesis is true and wrongly rejected, and type II error occurs when the null hypothesis is wrongly accepted when false.

Now, things get a little complicated. First we need to introduce a variable, p_u, which is the probability of success of the underlying process. In the listening context, this is the probability that a listener can identify a difference between conditions, which is based on the acuity of the listener, the magnitude of the differences, and the conditions of the trial (e.g. the quality of the components, recording, ambient noise, etc). Unfortunately, we can never “know” p_u, but can only make reasonable guesses at it.

We also need to introduce the variable "alpha". Alpha, or the significance level, is the level at which we can reject the null hypothesis in favor of the alternative hypothesis. By selecting a suitable significance level during the data analysis, you can select a risk of Type 1 error that you are willing to tolerate. A common significance level used in DBT testings is .05.

Finally, we need to look at the probability value. In hypothesis testing, the probability value is the probability of obtaining data as extreme or more extreme than the results achieved by the experiment assuming the null hypothesis is true (put another way, it is the likelihood of an observed statistic occurring on the basis of the sampling distribution).

Once the DBT is performed, one compares the probability value to alpha to determine whether the result of the test is statistically significant, such that we can reject the null hypothesis. In our example, if the null hypothesis is rejected, we can concluded there are in fact audible differences between ICs.

Now, here comes the fun part. It might seem that you want to set the smallest possible significance level to test the data, thereby producing the smallest possible risk of Type 1 error (i.e., set alpha to .01 as opposed to .05). However, this doesn’t work, because, as you reduce the risk of Type 1 error (lower alpha), the risk of Type 2 error necessarily increases.

Further, and a greater impediment to practical DBT testing, is that the risk of Type 2 error increases not only as you reduce Type 1 error risk, but also with reductions in the number of trials (N), and the listener's true ability to hear the differences under test. Since you really never know p_u, and can only speculate on how to increase it (e.g., by selecting only high quality recordings of unamplified music using a high quality system to test the ICs), the best ways to reduce the risk of Type 2 error in a practical listening test is by increasing either N or the risk of Type 1 error.

Now for some examples. Let's assume we use 16 tests on the IC in question. For purposes of the example, further assume that the probability of randomly guessing correctly whether the new IC was inserted is 0.5. Finally, we must make a guess at “p_u”, which we could say is 0.7. In this instance, the minimum number of correct results for the probability value to exceed .05 is 12 (our type I error in this case is = 0.0384). However, our type II error in this case goes through the roof - in this example, it is .5501, which is huge! Thus, this test suffers from a high level of type 2 error, and is therefore unlikely to resolve differences that actually exist between the interconnects.

What happens if there were only 11 correct results? Our p value is then .1051, which exceeds alpha. Thus, we are not able to reject the null hypothesis in favor of the alternative hypothesis, since the p value is greater than alpha. However, this does not allow us to concluded that there are in fact no audible differences between Ics. In other words, data not sufficient to show convincingly that a difference between conditions is not zero do not prove that the difference is zero.

So now lets increase the number of trials to 50. Now, the number of correct results needed to yield statistically significant results is 32 (p value = .0325). Assuming again p_u is 70%, our Type 2 error drops to ~ 0.14, which is more acceptable, and thus differences between conditions are more likely to be revealed by the test.

OK, one last variation. Let’s assume that the differences are really minor, or we are using a boom box to test the interconnects, such that p_u is only 60%. What happens to Type II error? It goes up - in the 50 trial example above, is goes from .1406 to .6644 - again, the test likely masks any true difference between ICs.

To sum up, DBT is tool that can be very useful in the audio context if used and understood correctly. Indeed, this is where I take issue with Bomarc, when he says "I don't want to get into statistics, except to say that's usually not the weak link in a DBT". Rather, the (mis)understanding of statistics is precisely the weak link in applicability of DBTs.
Thanks, Rzado, for the refresher course. Let me try to summarize for anyone who fell asleep in class. In a DBT, if you get a statistically significant result (at least 12 correct out of 16 in one of Radzo's examples), you can safely conclude that you heard a difference between the two sounds you were comparing. If you don't score that high, however, you can't be sure whether you heard a difference or not. And the fewer trials you do, the more uncertain you should be.

This doesn't mean that DBTs are hopelessly inconclusive, however. Some, especially those that use a panel of subjects, involve a much higher number of trials. Also, there's nothing to stop anyone who gets an inconclusive result from conducting the test again. This can get statistically messy, because the tests aren't independent, and if you repeat the test often enough you're liable to get a significant result through dumb luck. But if you keep getting inconclusive results, the probability that you're missing something audible goes way down.

To summarize, a single DBT can prove that a difference is audible. A thousand DBTs can't prove that it's inaudible--but the inference is pretty strong.

As for my statement about statistics not being the weak link, I meant that there are numerous ways to do a DBT poorly. There are also numerous ways to misinterpret statistics, in this or any other field. Most of the published results that I am familiar with handle the statistics properly, however.
Good post, Bomarc - I agree with 98% of what you had to say. I guess the one thing I'm not sure about is the point you are making with respect to multiple inconclusive tests lending to a strong inference that a difference is inaudible. If you have multiple tests with high Type 2 error (e.g. Beta ~.4-.7), I do not believe this is accurate. However, if you have multiple tests where you take steps to minimize Type 2 error (high N trials), I can see where you are going. But you are correct, that can start getting messy.

Thanks for clarifying your point about statistics, though. In general, I tend to give experimenters the benefit of the doubt with respect to setting up the DBT, unless I have a specific problem with the setup. But I agree, there are numerous ways to screw it up.

However, the few studies in high-end audio with which I am familiar(e.g. the ones done by Audio magazine back in the 80's) in general suffered from the problems outlined above (small N leading to high Type 2 error, erroneous conclusions based on non-rejection of null hypothesis due to tests not achieving p value < .05). There have been a couple of AES studies with which I'm familiar where the setup was such that p_u was probably no better than chance - in that circumstance, you can say either the setup is screwed up or the interprettion of the statistics is screwed up. At least one or two studies, though, were pretty demonstrative (e.g. the test of the Genesis Digital Lens, which resulted in 124 out of 124 correct identifications).

My biggest beef with DBT in Audio is that you just need to do the work - i.e. use high N trials - which is a lot easier said than done.
Rzado: My point on retesting is this: If something really is audible, sooner or later somebody is going to hear it, and get a significant response, for the same reason that sooner or later, somebody is going to flip heads instead of tails. If you keep getting tails, eventually you start to suspect that maybe this coin doesn't have a heads. Similarly, if you keep getting non-significant results in a DBT, it becomes reasonable to infer that you probably (and we can only say probably) can't hear a difference.

As for published studies, the ones I've seen (which may not be the same ones you've seen) generally did get the statistics right. What usually happens is that readers misinterpret those studies--and both sides of The Great Debate have been guilty of that.