Welcone to Statistics 101, and so snickering, please.
A not all that close reading of the article will lead anyone familiar with experimental design to conclude that the results are without meaning. I'm sorry I didn't see this thread sooner so my comments could be more current. The sections in quotes were cut and pasted from the article.
1. Reading the article from Secrets of Home Theater and High Fidelity, it is apparent that the procedure has many, many steps that might influence the results and that the two groups had different procedures used. The non-comparability of the groups and multiple steps mean that the overall procedure may be flawed. Is there proof? No, but there is equally any lack of evidence that this is a valid procedure. Stop, start; un plug, replug;(on multiple components, no less), power down, power up; warm-up; musical selection length determined, how? Half of the participants attended a training session the month before. The second trial had musical selections that were longer than the first. Group one (no snickering here, please) had the felt down for the training and felt up for the listening. Group two were felt up both ways. ( I said, NO SNICKERING!) Group one ate after the test and greeted Group two. Group two ate before the test with Group one. Lots of experimental manipulation, all of which was quite apparent to the participants. What might this have done? Still, the two groups were different from one another and were different interanlly by virtue of the pre-trainning. At a minimum, the data from the two should not have been aggregated.
Participants were 80% correct in their responses to the selection from the Berlioz Requiem. Manny calls this very close to the threshold between chance and perception. None of the other selections produced responses higher than 60%. This phenomenon correlates with John Atkinsons experience that his participants fared best on massed choral music. If any of us were mad enough to conduct another blind test of this nature, I would choose audiophile recordings of massed choral music for at least 50% of the musical selections. It would be interesting to discover if it would make a difference.
2. The procedure apparently produced at least one condition that had uniformly positive results, but Manny says it doesnt matter. How did Manny reach this conclusion?
In post-test discussion, several of us noted that we had great difficulty remembering what A had sounded like by the time we got through with X. Several participants said that the way they dealt with this phenomenon was by ignoring A entirely and simply comparing B to X without giving thought to A.
3. Procedures may well have skewed the results.
In many cases, statistically significant differences could be discerned by participants. In others, no differences could be discerned.
4. He does not make it clear how he determined this. Even earlier in the review, he noted:
...that the very procedure of a blind listening test can conceal small but real subjective differences....
5. Hmmm.....
"But, no, you have to take all the data together. You can't just pick out the numbers that suit your hypothesis. This would be statistically invalid. Same thing with just looking at one music selection."
6. But, if the procedures have many elements that compromise the overall validity, this conclusion is unsubstantiated.
The fact there was some evidence of statistically significant differences some of the time suggests that something may have been going on. Lumping all of the data together is not necessarily a good statistical procedure, particularly with so may manipulations going on and the differences between the two groups and individual participants by having attended some pre-training. With any group, it could be that there is one individual who can detect differences, or one type of music that makes this more detectable. No one is really sure of what to make of statistical outliers (those many standard deviations from the mean), but citing group statistics does not address the issue, particularly with small groups.
"But, we can't do that and claim good science."
7. Calling it science does not make it science.
These procedures may or may not be flawed. What is clear though, is that there is very little statistical power in such procedures -- two small groups with a large number of experimental conditions.
My conclusion is that nothing can be learned from this test as structured. Using statistical procedures to analyze poorly run experiments cannot redeem the experiments. Lots of experiments designed by far more accomplished folks are found to be flawed. This would never be published by anything other than an on-line audiophile publication.
Buy a better power cord and decide for yourself. Get it from a source that allows a trial period.
Rouvin