@faustuss
Now that you have educated yourself, you may be able to appreciate that when we say that a transport delivers Direct Stream Digital (DSD) natively over High-Definition Multimedia Interface (HDMI), we really mean it!
DSD
To be absolutely clear, DSD is a brilliantly simple way of encoding sound waves. Each successive bit indicates whether to notch the digital sound pressure level up or down by one quantum, when compared to the real analogue sound pressure level, Silence is represented by an endless series of (up, down) at Megahertz rates. Rising sound pressure will have more ups than downs, while falling has more downs than ups. The analogue to digital conversion should not be more than half a quantum out.
Moreover the quantisation noise can be removed by passing the bit stream through a gentle low-pass filter rolling off in the MHz region. In principle the output can be played as an analogue signal without further processing except for a volume control!
Conventionally geeks refer to ups as 1 and downs as 0 in the digital domain.
PCM
By contrast, Pulse Code Modulation (PCM) samples the sound pressure level about 44,000 times per second (using CDs as an example). At each sample, it measures the sound pressure level on a linear scale from about -32,000 to +32,000. This range can be encoded into 16 binary bits (0 or 1) where each bit represents twice the level of the previous bit.
PCM cannot encode frequencies higher than half the sampling rate, and the low pass filters needed to remove digital noise must operate close to audible high frequencies.
Binary
You can see how this works by counting in binary using your fingers. Using only the four fingers on your left hand, you can count from decimal 0 to 15. The first finger represents 0 or 1, the next 0 or 2, then 0 or 4 etc. You add up all the fingers to get the decimal equivalent. Keep going with the four fingers on your right hand and you get anywhere from decimal 0 to 255. (This is the range of one computer byte, and looks like part of a four byte Internet Protocol version 4 address, eg 192.255.1.201)
Add in four toes from each foot, giving you 16 bits, and you can count from 0 to 65535. Or -32767 to +32768 if one finger represents the - sign, which is handy because sound waves go both up and down from silence (0). Why didn’t our Neanderthal relatives invent counting this way?
PCM linearity
So what is the issue with PCM? Well, the most significant bit represents a value 16384 times bigger than the least significant bit. At some point, when the sound pressure rises by 1, all fifteen lower bits switch off and the biggest one switches on. It is almost impossible to manufacture any analogue device, like a resistor, to this degree of accuracy.
Imagine feeding every number between -32767 and +32768 in sequence to a Digital to Analogue Converter (DAC). Ideally the output would increase with every sample – known as a monotonic increase. But resistors cannot be made accurate enough.
Philips knew this from the get-go, and just used the top 14 bits in their early CD players. It gets even more ridiculous with high resolution at 24 bits, which needs 256 times more accuracy still, or 32 bits which needs 65,536 times the accuracy in order to resolve every bit.
Sigma-Delta
Enter the sigma-delta technique. If you take the difference between consecutive PCM levels, you can in theory add or subtract that many quanta to get from one sample to the next. So in effect, you convert the differences between consecutive sample to a local DSD stream. It is easier to add a number of near-identical charges than to accurately trim a resistor.
It is similarly easy to create a PCM sequence from a DSD stream – just count the net number of ups and downs over 64 bits (for DSD64) and add or subtract from the previous sample. But you cannot go the other way without guessing where each up or down should fit in time. You lose information. This is why a DAC that handles a DSD stream natively is better than a DAC chip that needs DSD to be externally converted to PCM.
Transporting digital formats
Any digital format can be chunked up into packets. The packets can be any size.
If you want to send say a DSD stream over I2S, just chunk it up into 16-bit chunks and push each chunk into the pipeline. The receiving end has to know what format is being delivered so it can unpack the chunks into a new stream identical to the original DSD stream. It is not really over but inside and nothing is lost.
The network technology can be anything that supports chunks (known as packets). Internet Protocol, Ethernet, Universal Serial Bus (USB); anything will do if it is error free and fast enough.
On the other hand, if the DSD stream has to be converted to the two-channel 16-bit PCM samples I2S expects, then the timing detail gets lost and most likely the conversion will not be monotonic.