Reviews with all double blind testing?


In the July, 2005 issue of Stereophile, John Atkinson discusses his debate with Arnold Krueger, who Atkinson suggest fundamentally wants only double blind testing of all products in the name of science. Atkinson goes on to discuss his early advocacy of such methodology and his realization that the conclusion that all amps sound the same, as the result of such testing, proved incorrect in the long run. Atkinson’s double blind test involved listening to three amps, so it apparently was not the typical different or the same comparison advocated by those advocating blind testing.

I have been party to three blind testings and several “shootouts,” which were not blind tests and thus resulted in each component having advocates as everyone knew which was playing. None of these ever resulted in a consensus. Two of the three db tests were same or different comparisons. Neither of these resulted in a conclusion that people could consistently hear a difference. One was a comparison of about six preamps. Here there was a substantial consensus that the Bozak preamp surpassed more expensive preamps with many designers of those preamps involved in the listening. In both cases there were individuals that were at odds with the overall conclusion, and in no case were those involved a random sample. In all cases there were no more than 25 people involved.

I have never heard of an instance where “same versus different” methodology ever concluded that there was a difference, but apparently comparisons of multiple amps and preamps, etc. can result in one being generally preferred. I suspect, however, that those advocating db, mean only “same versus different” methodology. Do the advocates of db really expect that the outcome will always be that people can hear no difference? If so, is it the conclusion that underlies their advocacy rather than the supposedly scientific basis for db? Some advocates claim that were there a db test that found people capable of hearing a difference that they would no longer be critical, but is this sincere?

Atkinson puts it in terms of the double blind test advocates want to be right rather than happy, while their opponents would rather be happy than right.

Tests of statistical significance also get involved here as some people can hear a difference, but if they are insufficient in number to achieve statistical significance, then proponents say we must accept the null hypothesis that there is no audible difference. This is all invalid as the samples are never random samples and seldom, if ever, of a substantial size. Since the tests only apply to random samples and statistical significance is greatly enhanced with large samples, nothing in the typical db test works to yield the result that people can hear a difference. This would suggest that the conclusion and not the methodology or a commitment to “science” is the real purpose.

Without db testing, the advocates suggest those who hear a difference are deluding themselves, the placebo effect. But were we to use db but other than the same/different technique and people consistently choose the same component, would we not conclude that they are not delusional? This would test another hypothesis that some can hear better.

I am probably like most subjectivists, as I really do not care what the outcomes of db testing might be. I buy components that I can afford and that satisfy my ears as realistic. Certainly some products satisfy the ears of more people, and sometimes these are not the positively reviewed or heavily advertised products. Again it strikes me, at least, that this should not happen in the world that the objectivists see. They see the world as full of greedy charlatans who use advertising to sell expensive items which are no better than much cheaper ones.

Since my occupation is as a professor and scientist, some among the advocates of double blind might question my commitment to science. My experience with same/different double blind experiments suggest to me a flawed methodology. A double blind multiple component design, especially with a hypothesis that some people are better able to hear a difference, would be more pleasing to me, but even here, I do not think anyone would buy on the basis of such experiments.

To use Atkinson’s phrase, I am generally happy and don’t care if the objectivists think I am right. I suspect they have to have all of us say they are right before they can be happy. Well tough luck, guys. I cannot imagine anything more boring than consistent findings of no difference among wires and components, when I know that to be untrue. Oh, and I have ordered additional Intelligent Chips. My, I am a delusional fool!
tbg
Therefore, participants should be able to demonstrate their critical listening skills.

Once again, the scientists are ahead of you. Standards for appropriate listener training exist. And they weren't devised based on the misapplication of principles from visual perception, let alone high-end cant; they were developed through experience that identified the background necessary to produce reliable results, both positive and negative.

If anyone doesn't feel those standards are sufficiently high, there has always been an alternative: Propose higher standards, and then find some audible difference that can't be heard without the benefit of your more rigorous training. For all the griping about DBTs, I don't see anybody anywhere doing that.

Finally, recalling the original subject of this thread, has any audio reviewer ever demonstrated that he posesses "critical listening skills" in a scientifically rigorous way? Nope. In fact, there's at least a little data suggesting that audio reviewers are *less* effective listeners than, say, audio dealers. This isn't too surprising. A dealer who carries equipment that sounds bad will go out of business. If a reviewer recommends something that sounds bad, he just moves on to the next review.

Pableson-The DBT/ABX does not measure what sounds good or bad it only measures whther the listerner can identify a or b when comapred to x.

Its' not that we think ABX is not good enough. we think it's irrelevant.

It's been my experience that dealers are more of a slave to the audio press than audiophiles have ever been.

Finally reviewers are not scientist they are critics who give opinions as a guide. If audiophiles are taking thier opinions as gospel, they should trust thier own ears.
.. and by the way I am deeply troubled by your of the terms sounding good or bad. Those are decidely subjective terms.
Just what scientific test did you use to come to that conclusion?
I think that DBT is a way of validating (or invalidating) "decidedly subjective" opinions about sonic quality. If you can't reliably tell which is which, your opinion about which is better must be taken with a big grain of salt.
Gregadd: There are lots of different DBTs. Some measure preferences, some measure *how* different two things are, etc. I'd say a reviewer should be allowed to use whichever method he likes (or invent his own, as long as it's level-matched and blind). But if he can't tell the difference between his own amp and the one he's reviewing under those very generous conditions, I think his readers ought to know that. Don't you?

And what's your problem with saying that equipment sounds good or bad? This is an audio discussion site. Eighty percent of the conversations here are about that. As for scientific tests of good and bad sound, that's what Sean Olive at Harman does for a living. Try Googling him.
No I don't think they should dbt because even when the reviewers pass the test so called objectivist don't accept it.
So readers should be kept in the dark about the listening abilities of the reviewer? Whose interest does that serve?

Pableson- No they should not be kept in the dark about the reviewers ability.
You assume that ABX/DBT is the standard by which everyone must be measured. I categorically reject that premise! Therefore I cannot answer any questions that require me to accept that premise.

As a long time reader of audio component reviews I am aware of thier shortcomings. However the overwhelming majority of reviewers admit that they are fallble and that they have listening biases. Thier review is thier personal opinion. Thier goal is to identify components which they believe bring us closest to the reproduction of music.

Let's take the absolute sound for example when Harry Pearson still owned it. They periodically published their background, room dimensions, personal listening biases and associated equipment.
Once they have identified an audio component as having merit it is then the readers job to make his own evalaution. This is true for any critic. If the critic is consistently wrong ultimately the readers will go somewhere else for opinions.

They way I evalaute a reviewers ability is to listen to components they have recommended to see if thier opinion is valid. If I can't duplicate thier experience on a consistent baisis then I have to doubt thier ability.
Pabelson, I doubt if any reviewer could "pass" the DBT. This is because of the methodology. Any substitution of other than same/different methods would likely result in rejection of the results by DBT proponents as Gregadd says. Subjectivists would no doubt ignore reviewers "failing" the test. Nothing would be proven to anyone's satisfaction by the entire effort, so what is the point?

Somehow you seem to believe that reviewers are the arbitrators of quality leading customers around like sheep. As is often noted the "best component" issues outsell other issues. I do not know whether this "proves" the influence of reviewers or magazines. Some may just be keeping a count on where their equipment falls.
Testing the listening abilities of a reviewer...so what are u saying pabelson: that at the beg of each review, a summary of a DBT on the part of the reviewer of the particular equip being tested or listening abilities in general? The latter doesn't seem to work as DBT by nature is specific not broad. So if the former, you would then rule out reading the rest of review, perhaps apart from factual description? Really? I guess you would. Personally perhaps the additional info may be interesting, it does not add that much to me. The reason being again the DBT is for a specific time & enviroment: namely the rest of the control variables...the rest of the system which the reviewer may not be choosing? Or should he? The rest of the system, if for example being constructed with equip unfamilliar to the reviewer, what is he listening to? You may say he only needs to demonstrate that one piece of equip was chged, but in such an unfamiliar setting, our ears could gravitate and focus on something completely else.
Henry: Yes, of course, listening tests are only relevant for the specific gear you're listening to. But a review is about specific equipment. Think of it this way: A reviewer has a reference system. He gets a new amp for review. Can he tell whether his original amp or the review amp is in his system, without looking? If not, is there any value at all to what he says about the sound of the review amp?
I know of no one who requires anyone to take a mini test to establish thier credibility before they perform the real test. That would be tantamount to my being required to be successful in a little mini-trial before I do the real thing.
This would be equivalent to asking your Doctor to examine two patients whose illness was known in advance before he could examine you. It would not be practical and it would not prove anythig.
Furthermore one person passing a test does not prove antyhing. The sample group would have to be of sufficient size. One person's success or failure could be easily discounted statistically. Thus if say Harry Pearson took the test and scored a perfect score his results could easily be invalidated. The majority of the other reviewers could flunk or have a statisticaly insignificant number of successes which would mean his success is a statistical abberation. That sort of heads I win, tails you lose logic does not work.
This time I mean it. If you listen the results should be obvious, if not don't buy. I have nothing else to say.
In any AB comparison, the two compared signal path elements (components, cables, tubes, etc.) must have at least a week's trial of listening experience before being switched to the other, preferably obver 4 or more such switches.

More immediate AB comparison is not sufficient to reveal subtle but significant differences.

This is because of the nature of auditory perception and its dependence on memory accrued over time (days or weeks, and not hours).

Changes attributed to burn-in are nearly always the result of becoming familar (memory accrual) of a component over days or weeks, for instance - your perception is what changes and not the component.

Imediate AB testing can be downright invalid, and is only useful for detecting large and obvious sonic differences.
This is because of the nature of auditory perception and its dependence on memory accrued over time (days or weeks, and not hours).

This is just 180 degrees opposite of the truth. Auditory memory for subtle differences dissipates in a matter of seconds. I defy you to cite a single shred of scientific evidence to the contrary.
I’ve gone over this debate and would like to summarize many of the points made.

As to DBT there may be:
1. Problems with methodology, per se, in audio;
2. Problems with most DBT that has been done, e.g., lack of random assignment or no control groups, making these experiments invalid scientifically;
3. Problems with particular experiential designs that are unable to yield meaningful results;
4. Sample problems, such as insufficient sample size, non-random samples;
5. Statistical problems making interpretation of results questionable.

All of these problems interact, making the results of most DBT’s in audio scientifically meaningless.

Advocates of DBT have been especially vociferous in this forum, but what have they actually said to respond to these criticisms? Virtually nothing beyond "No!" or "Where’s your proof?"

The "proof" of their position cited has been interesting, but it has been a reporting on the power of "sham" procedures or other stories that do not meet the guidelines necessary for a DBT procedure to qualify as science.

At the same time, they call DBT science, and maintain the supremacy of science. Calling something science without strictly adhering to scientific procedures, unfortunately is not science, and this is the case with DBT in audio far more often than not. It is more akin to the claims that intelligent design is science than it is science at this point. An additional point made in this forum has been the large number of DBT’s done that have failed to demonstrate that differences can be heard. A large number of scientifically compromised procedures yields no generalizable conclusions.

For anyone who has worked at a major university research mill, as I have, the skepticism about research results is strong. It is not that there is an anti-research or anti-science attitude. Rather, it is a recognition that the proliferation of research is more driven by the necessity of publishing to receive tenure and/or the potential for funding, increasingly from commercial interests that have compromised the whole process. We will have to see what happens to scientific DBT in audio when and if it happens.

I conclude that we are speaking fundamentally different languages when advocates of subjective audio evaluation and DBT advocates speak. For my part, subjective evaluation is fine as long as I understand that I better think twice before I believe a reviewer. I also truly believe in the supremacy of science, and intelligent design is not science.
Rouvin, I substantially agree, of course. I agree moreover about the liabilities of publish or perish in academia and its effect on research, even though I am in a field with no commercial interests, other than public polling.

I do study public policy also, including the impact of creationism or intelligent design as it is now called. It is awkward to get good state data on science degrees issued before and after adoption of anti-evolution policies, but the worst states in terms of failing to teach evolution have not experienced a decline in science degrees. They never had many in Kansas, for example. It is much like abortion restrictions, the states that adopt such restrictions are those with few abortions and experience no decline thereafter. Where abortion is common, no politician would risk introducing a restriction or voting for one.

I too have been struck by why those advocating DBT seem to think that anyone need bother paying attention to results when buyers obviously hear a difference which causes them to buy. Anyone who trusts reviewers other than to suggest what you might want to give a listen, are bound to be disappointed.
For a guy who doesn't believe in intelligent design, Rouvin, you practice its methods to perfection. You offer no evidence of your own--no tests, no results, nothing that can be replicated or disproved. Instead, you quibble with the "methodology," which you seem substantially uninformed about ("e.g., lack of random assignment or no control groups, making these experiments invalid scientifically"--Why in the world would you need a control group in a perception test?)

We are speaking different languages, Rouvin. DBT advocates are speaking the language of science. You are not.
Sadly, anyone who views looking at methodological issues as quibbling does not understand that methodology is at the heart of science. Boring, tedious, surely, but a necessary condition to call something science. Your question, "Why in the world would you need a control group in a perception test?" reveals that you need to brush up on your basic science.
Every scientific field has its own methodology, Rouvin. If you had made an effort to acquaint yourself with the rudiments of perceptual psychology, you'd be in a better position to pontificate on it.

By the way, methodology is NOT at the heart of science. Empiricism is. Methodology is just a means to an end. Empiricism demands reliable, repeatable evidence. You still haven't got any.
Pabelson, data is the heart of science. To gather it one has to have operationalizations of the concepts in your hypothesis which involves methodology. Your distinction is not meaningful.

You are always justifying DBTs as often used in perceptual psychology. Such appeals are unscientific appeals to authority. There are many reasons to believe that as applied to audio gear, this methodology does not validly assess the hypothesis that some components sound better.

You, sir, also have no evidence that is intersubjectively transmissible. Furthermore, as I have said repeatedly, I would not care anyway. I buy what I like and need not prove anything to you or others wrapping themselves in the notion that they are the scientists and those who take exception to them are unscientific.
Pabelson. I think "you have no proof" counter arguement falls flat when you have not provided any yourself. Its one thing to have some DBT results...but what Rouvin is pointing out is such tests in isolation tell us nothing without a substantially larger sample size, statistical significance testing etc....what is the scientific method or what you call empricism (a few isolated DBTS cannot really be considered empirical evidence). Ergo asking reviewers to be subjected to such a test, then provide their normal reviews...equally is misleading.
data is the heart of science.

And this thread, now at over 170 posts, still doesn't contain a shred of reliable, replicable data demonstrating audible differences between components that can't be heard in standard DBTs.

There are many reasons to believe that as applied to audio gear, this methodology does not validly assess the hypothesis that some components sound better.

Name one. Check that. Name one that won't get you laughed out of the Psych Dept. faculty lounge.

You, sir, also have no evidence that is intersubjectively transmissible.

I don't need "evidence that is intersubjectively transmissible," because I'm not changing the subject. The subject is hearing, and what humans can and cannot hear. In order to argue that DBTs can't be used for differences in audio gear, you have to claim that human hearing works differently when listening to audio gear than it does when listening to anything else. That's about as pseudoscientific as it gets.
What is laughable in the Psych. Dept. is not authoritative. Data has to be presented to justify that DBT validly assesses sound differences among components. DBT lacks face validity as most can hear differences. You, sir, are the one guilty of scientific error no matter how much you protest that others are pseudoscietific.

But more fundementally, we are not engaged in science in picking wine, cars, clothing, houses, wives, or audio equipment, so Charlie is right. Put this to bed. Neither of us is convincing the other nor ever will.
Henry: Go back and re-read the thread. I provided a link to a whole list of articles on DBTs, including tests of cables, amps, CD players, tweaks, etc. That's why I'm on solid ground in demanding that the opponents of DBTs do the same. As of yet, no one has come up with a single experiment anywhere disputing what those tests have shown. Not one.

Science isn't done by arguing about methodology in the abstract. It's done by developing a better methodology and producing more robust results with it. People like Rouvin wouldn't even know how to do that. And the people who do know how to do that aren't doing it, because they have better uses of their time than re-proving something that's been settled science for decades. If you think it isn't settled, then it's up to you to come up with some evidence that unsettles it.
TBG: This thread had been put to bed. It was dormant for two weeks. I'm not the one who revived it. And if it doesn't matter to you, why do you keep posting? Go buy your expensive cables, and just enjoy the pleasure they give you. As you said, what does it matter to you if scientists say your cables are indistinguishable from zipcord?
Pabelson, I do wish it would die, but you continue to misrepresent what science is and who best represents it. There is no evidence anywhere, including with your sacred, DBTesting, that demonstrates given evidence that cables don't sound different. You are not the authority who could declare that science has proven something for decades. No finding is ever proven rather it is tentatively accepted unless further data or studies using different methodologies suggests an alternative hypothesis. Robustness also is not of much use except to suggest that replications have often been done.

It is just the case that I will not concede that scientists or anyone has shown my better sounding cables are indistinguishable from zip-cord. Any fool's testing would indicate that is untrue, even if only in sighted comparisons.
Who's misrepresenting what, TBG? I never said cables can't sound different. I cited an article earlier that did 6 cable comparisons, and 5 of them turned out positive. I've corrected your misstatements about this previously. Please don't repeat them again.

Just for the record, what DBTs actually demonstrate is that cables are audibly distinguishable only when there are substantial differences in their RLC values. For most cables in most systems, that is rarely the case. Exceptions may include tube amps with weird output impedances, speakers with very difficult impedance curves, and yards and yards of small-gauge cable.

No finding is ever proven rather it is tentatively accepted unless further data or studies using different methodologies suggests an alternative hypothesis.

Exactly. So where's your data? Where are your studies?

Any fool's testing would indicate that is untrue, even if only in sighted comparisons.

The faculty lounge is most amused.
TBG, I submit that your 'better sounding' cables are distinguishable from the simple zip cord, precisely because their frequency response variation is large enough that one can hear the difference in a DBT. A zip cord of sufficient gauge (low resistance) to pass the current necessary to run the speakers and not affect the damping will be much more linear than your 'better sounding' cables. DBT will distinguish between sufficiently different sounding cables, but will also prove when there is no significant difference.
With respect, Bob P.
You said,"As you said, what does it matter to you if scientists say your cables are indistinguishable from zipcord?" I would take that to mean that you meant this.

I merely would state that I and many others reject that DBT validly assesses sonic difference among cables, etc. Where is your demonstration of face validity or any demonstration of validity?

My faculty room was also amused that I had any confidence in experiments, which they view as not isomorphic or generalizable to real life. They are always on my case for approving Psych. proposals that use the force contributions from students taking Psych. courses. They are enamoured with econometric modeling usually assuming that humans are rational. I have never found humans maximize much other than perhaps to take the lazy way out, such as voting the political party they adopted from their parents.
I merely would state that I and many others reject that DBT validly assesses sonic difference among cables, etc. Where is your demonstration of face validity or any demonstration of validity?

Where to begin? First, we can physically measure the smallest stimulus that can excite the otic nerve and send a signal to the brain. It turns out that subjects in DBTs can distinguish sounds of approximately the same magnitude. This shows that DBTs are sensitive enough to detect the softest sounds and smallest differences the ear can detect.

To look at it another way, basic physics tells us what effect a cable can have on the signal passing through it, and therefore on the sound that emerges from our speakers. And basic psychoacoustics tells us how large any differences must be before they are audible. DBTs of cables match this basic science quite closely. When the measurable differences between cables are great enough to produce audible differences in frequency response or overall level, the cables are distinguishable in DBTs. When the measurable differences are not so great, the DBTs do not produce positive results.

That's how validation is done--we check the results of one test by comparing it to knowledge determined in other ways. DBTs of audio components came late to the party. All they really did was to confirm things that scientists already knew.
As I have said before too many times, were DBTs that were not same or different tasks used and to show no differences, many who viewed this as science would be inclined to accept this as a valid measure of sounding different and better. Same or different questions over brief periods do not give results that have face validity.

Again, this discussion should be laid to rest. Your evidence and appeals to "what scientists already knew" authority are not the way to make your conclusions broadly accepted. Again, were this a matter of what would cure cancer, etc., there probably would be the need to resolve what is an appropriate test, but it is not. As such it is not relevant to discussions on Audiogon or AudioAsylum.
Your evidence and appeals to "what scientists already knew" authority are not the way to make your conclusions broadly accepted.

Rest assured, I have no illusions about the possibility of convincing someone who, despite a complete lack of knowledge about the field, nonetheless feels qualified to assert that a test methodology used by leading experts in the field for decades lacks "face validity."

I'm just demonstrating, to anyone who might be reading this with an open mind, that the people who carp about DBTs in audio threads have neither an understanding of the issue nor a shred of real tangible data to support their beliefs.
Despite someone who claims to be knowledgeable about research methods, you seem woefully insensitive to the need for your measures to validly assess the theoretical concept for which they are supposed to measure. I stead you make very unscientific appeals to authority which is perhaps the worst scientific infraction.

You have demonstrated that there is insufficient tangible data to dismiss criticisms of DBT as inapplicable to questions of what sounds best that could be shared among customers. Until the obvious disparity between what people hear and what DBT shows is resolved, no one is going to make buying decisions based on DBT. Perhaps you do, but I doubt it.

I am off to CES, so I will not be monitoring further useless appeals to authority.