I applaud this test. There is no way to make it perfect but you are doing an admirable job of it.
On the off chance that there might be some differences (though I don't see how it could be) you should do amplifier output level matching (electrically) between the outlets. If there are differences, you will want to find a way to compensate for them.
I think you should also try the test with several different audio components, e.g., two different source components, two different amps, etc. Or at least with an amp and a source component. (Maybe you already discussed this.) You will need to do output level matching again or at least be careful about making sure that variations in loudness do not become a variable in the listening evaluation.
I think you already have too many outlets in the test for good experimental design. When we test a lot of concepts in online surveys, we use multiple respondent samples and expose each sample to a subset of the concepts. This is because respondents lose interest and ability to discriminate at the number of tests goes up. In your case, testing with multiple components just adds to the complexity. So you need to think about breaking the test into smaller chunks, such as comparing and scoring two or three outlets at a time, then throwing that "winner" in against 1-2 new ones.
There is also the phenomenon of "order bias." If you test the same outlets with different listening panels, change the order in which the outlets are heard.
I have a few additional thoughts, based entirely on my own experiences of comparing things in my system:
First, do not make rapid switches between outlets. Allow a brief pause.
Second, use short musical selections, If you are using digital, you can prep selections that are clipped to desired length.
Third, do not use the same musical selections for the entire test. I realize this may be controversial, but I have found that it's important to add new material while dropping some of the old stuff as you move through a series of comparisons. The discards can re-appear later on. My hypothesis for why this matters is that the experience of hearing a selection for the first time in the context of a test is very different from hearing it for, say, the sixth time, or even the second time. There is something about "newness." Every device under test should get the benefit of something new in the mix in addition to selections you have heard with the preceding device.
On the off chance that there might be some differences (though I don't see how it could be) you should do amplifier output level matching (electrically) between the outlets. If there are differences, you will want to find a way to compensate for them.
I think you should also try the test with several different audio components, e.g., two different source components, two different amps, etc. Or at least with an amp and a source component. (Maybe you already discussed this.) You will need to do output level matching again or at least be careful about making sure that variations in loudness do not become a variable in the listening evaluation.
I think you already have too many outlets in the test for good experimental design. When we test a lot of concepts in online surveys, we use multiple respondent samples and expose each sample to a subset of the concepts. This is because respondents lose interest and ability to discriminate at the number of tests goes up. In your case, testing with multiple components just adds to the complexity. So you need to think about breaking the test into smaller chunks, such as comparing and scoring two or three outlets at a time, then throwing that "winner" in against 1-2 new ones.
There is also the phenomenon of "order bias." If you test the same outlets with different listening panels, change the order in which the outlets are heard.
I have a few additional thoughts, based entirely on my own experiences of comparing things in my system:
First, do not make rapid switches between outlets. Allow a brief pause.
Second, use short musical selections, If you are using digital, you can prep selections that are clipped to desired length.
Third, do not use the same musical selections for the entire test. I realize this may be controversial, but I have found that it's important to add new material while dropping some of the old stuff as you move through a series of comparisons. The discards can re-appear later on. My hypothesis for why this matters is that the experience of hearing a selection for the first time in the context of a test is very different from hearing it for, say, the sixth time, or even the second time. There is something about "newness." Every device under test should get the benefit of something new in the mix in addition to selections you have heard with the preceding device.