I've offered a response I think back in 2002 but perhaps I can offer a bit more insight since then.
The all-important question is what "sound" are we talking about?
For my response, I'll assume you're implying "the absolute sound" (tas), the sound of unamplified music in a recording or concert hall space.
1) It should be quite rare to find any 'receiver' whose sonic performance can match a high-quality amp and preamp. A receiver and perhaps some-to-many multi-ch integrated amps are better known for convenience and being a jack-of-all-trades and master of none. Some of these receivers are so loaded with internal parts that I'd guess to match the internal component and build quality of some better 2-channel amps, a receiver might have to retail at $20k or more.
That is probably the most reasonable and easily justifiable explanation why multi-channel systems would struggle to sonically compete with 2-channel.
2) Along the same vein as no. 1 above, one would have to budget for similar quality of interconnects, speaker cables, and speakers to maintain an apples to apples comparison in sound. In other words, if you budget say $10k for interconnects, speaker cables, and speakers for a 2-channel system, you'd need to multiply that $10k amount a number of times over to maintain the same level of sonic quality and musicality in a multi-channel system. Without somebody willing to follow this methodology, it's impossible to do an apples-to-apples comparison and since people are more apt to compromise performance to meet a budget when going the multi-channel route, the multi-channel version will suffer compared to what they might own or purchase for just a 2-channel system.
3) Getting back to the absolute sound as the goal to strive for, it is no secret that some-to-many with a passion for live music and high-end freely confess that we are lucky if even our very best (2-ch) playback systems can capture 15% or at most 20% of the magic or believability of the live performance. Some think even this 15 or 20% is too high.
For the sake of argument, assume that 100% of the live performance is embedded in the recording medium and the source (the music server, cd player, or turntable) is also able to retrieve 100% of the information embedded in the recording. Obviously, one or both are potentially big assumptions.
But if per chance one or both of those assumptions are relatively accurate, that would imply that while processing the signal our components (including ics, scs, and speakers) are either only processing a fraction of the information or are processing the vast majority or even 100% of information but during the processing, our components (including cables and speakers) are smearing or blurring the signal to the point where the information becomes inaudible, drops into the noise floor while also raising the noise floor.
Since the components, ics, scs, and speakers in a 2-ch system has blurred or smeared much of the signal to point where the vast majority of the music information is inaudible (mostly low level detail and some high-level detail), then it stands to reason that even if all other things are equal, just the mere number of extra components, cables, and speakers for a multi-channel system would imply that a multi-channel system would induce more distortion (less music) simply because of the added hardware. And since it is unreasonable or even impossible for any component, cable, or speaker to be truly neutral, it stands to reason that this is a very viable possibility.
The problem with multi-channel is that some-to-many assume that introducing more speakers translates to more audible music information. There's simply no truth to that old wives tale. It's roughly the same audible information (15% or 20% at most of the live performance) of the 2-channel but now spread across more than just two speakers. This should be true even if the sound engineer inserted 1000 carefully placed recording mic's throughout the recording hall and the consumer had 1000 speakers to reproduce the recording in a listening room.
Additionally, if one owns a multi-channel system, it's not uncommon to start playing with the DSP modes and features to add a false sense of ambient information, that no matter how you look at it is a further distortion of the original signal since it is not actually retrieving or making audible more of the recorded information, it's simply altering the audible portions of it.
I think all 3 explanations hold water, but in the bigger scheme of things, I think number 3 is the most significant reason why multi-channel, as impressive and fun as it might sound, at best simply cannot retrieve any more information than a 2-channel system. At worst, multi-channel adds more distortion.
To an untrained ear the multi-channel could easily sound more impressive or 'real' but that does not mean it actually is more realistic sounding.
-IMO