要做到这一点的安全方法是使用有状态的UTF8解码器,该解码器可以从Encoding.UTF8.GetDecoder()
获得。
有状态解码器将在内部保存对应于不完整的多字节序列的字节。下次给它更多的字节时,它将完成序列并返回从序列中解码出的字符。
下面是如何使用它的一个例子。在我的实现中,我使用了一个char[]
缓冲区,其大小足以保证我们有足够的空间来存储X字节的完整转换。这样,我们只执行两次内存分配来读取整个流。
public static string ReadStringFromStream(Stream stream)
{
// --- Byte-oriented state ---
// A nice big buffer for us to use to read from the stream.
byte[] byteBuffer = new byte[8192];
// --- Char-oriented state ---
// Gets a stateful UTF8 decoder that holds onto unused bytes when multi-byte sequences
// are split across multiple byte buffers.
var decoder = Encoding.UTF8.GetDecoder();
// Initialize a char buffer, and make it large enough that it will be able to fit
// a full reads-worth of data from the byte buffer without needing to be resized.
char[] charBuffer = new char[Encoding.UTF8.GetMaxCharCount(byteBuffer.Length)];
// --- Output ---
StringBuilder stringBuilder = new StringBuilder();
// --- Working state ---
int bytesRead;
int charsConverted;
bool lastRead = false;
do
{
// Read a chunk of bytes from our stream.
bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length);
// If we read 0 bytes, we hit the end of stream.
// We're going to tell the converter to flush, and then we're going to stop.
lastRead = (bytesRead == 0);
// Convert the bytes into characters, flushing if this is our last conversion.
charsConverted = decoder.GetChars(
byteBuffer,
0,
bytesRead,
charBuffer,
0,
lastRead
);
// Build up a string in a character buffer.
stringBuilder.Append(charBuffer, 0, charsConverted);
}
while(lastRead == false);
return stringBuilder.ToString();
}
任何将代码点分割为8192字节的边界将失败,是的。为什么要以UTF-8解码才能立即重新编码? – Ryan
不,它不安全。更好的方法是'accumulator = new StreamReader(stream,Encoding.UTF8).ReadToEnd()' –