C# 如何读取包含特殊字符的 ANSI 编码文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1432064/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read an ANSI encoded file containing special characters
提问by Enyra
I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
我正在编写一个 TFS 签入策略,它检查我们的源文件是否包含我们的文件头。
My problem is, that our file header contains a special character "?" and unfortunately some of our source files are encoded in ANSI. So if I read these files in the policy, the string looks like this "Copyright ? 2009".
我的问题是,我们的文件头包含一个特殊字符“?” 不幸的是,我们的一些源文件是用 ANSI 编码的。因此,如果我在策略中读取这些文件,字符串看起来像这样“Copyright ? 2009”。
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright ? 2009"?
我厌倦了更改字符串的编码,但这无济于事。那么我怎样才能读取这些文件,从而获得正确的字符串“Copyright ? 2009”?
采纳答案by Jon Skeet
Use Encoding.Default
:
使用Encoding.Default
:
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usuallywhen people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
但是,您应该知道,它使用系统默认编码读取它 - 这可能与文件的编码不同。没有称为 ANSI 的单一编码,但通常当人们谈论“ANSI 编码”时,他们指的是 Windows 代码页 1252 或他们的盒子碰巧使用的任何东西。
Your code will be more robust if you can find out the exactencoding used.
如果您能找出所使用的确切编码,您的代码将更加健壮。
回答by AnthonyWJones
It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
如果您要制定这样的政策,让团队同意标准编码,这似乎是明智的。老实说,我不明白为什么任何团队会使用“Unicode(带签名的 UtF-8)-代码页 65001”以外的编码(可能除了具有重要非拉丁静态内容的 ASPX 页面,但即便如此我也不能t 看看使用 UTF-8 会有什么大不了的)。
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText
. Its not easy to determine this from the file however using Encoding.Default
is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).
假设您仍然希望允许混合编码,那么接下来您需要一种方法来确定文件保存在哪种编码中,以便您知道要传递给ReadAllText
. 从文件中确定这一点并不容易,但是使用Encoding.Default
可能可以正常工作。因为它很可能只有 2 种编码要处理,VS(带有签名的 UTF-8)和您的机器使用的常见 ANSI 编码(可能是 Windows-1252)。
Hence using
因此使用
string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
将工作。(正如我所见,Jon 已经发布了)。这是有效的,因为当 UTF-8 BOM(这是 VS 术语“签名”的意思)出现在文件的开头时,提供的编码参数将被忽略,并且无论如何都会使用 UTF-8。因此,在使用 UTF-8 保存文件的地方,您会得到正确的结果,而在使用 ANSI 的地方,您也最有可能得到正确的结果。
BTW if you are processing file headers wouldn't ReadAllLines
make things easier?.
顺便说一句,如果您正在处理文件头不会ReadAllLines
让事情变得更容易吗?。