C# 将 UTF8 数据插入 SQL Server 2008

Question

提问by Aaginor

I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.

我有编码问题。我想将 UTF-8 编码文件中的数据放入 SQL Server 2008 数据库中。SQL Server 仅具有 UCS-2 编码功能，因此我决定显式转换检索到的数据。

// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);

Here's the conversion routine for the data:

这是数据的转换例程：

private string ConvertTitle(string title)
{
  string utf8_String = Regex.Replace(Regex.Replace(title, @"\.", _myEvaluator), @"(?<=[^\])_", " ");
  byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
  byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
  string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);

  return ucs2_String;
}

When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.

在单步执行关键标题的代码时，变量 watch 显示 utf-8 和 ucs-2 字符串的正确字符。但在数据库中它 - 部分错误。一些特殊字符被正确保存，有些则没有。

Wrong: ń becomes an n
Right: é or é are for example inserted correctly.

错误： ń 变成了 n
右图：例如 é 或 é 被正确插入。

Any idea where the problem might be and how to solve it?

知道问题可能出在哪里以及如何解决吗？

Thans in advance, Frank

比提前，弗兰克

Answer 1

采纳答案by bobince

I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.

我认为您对什么是编码有误解。编码用于将一堆字节转换为字符串。字符串本身没有与之关联的编码。

Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that?—?to you, they're just strings of characters.

在内部，字符串作为 UTF-16LE 字节存储在内存中（这就是为什么 Windows 坚持通过将 UTF-16LE 编码称为“Unicode”来混淆每个人的原因）。但你不需要知道这一点？——对你来说，它们只是字符串。

What your function does is:

您的功能是：

Takes a string and converts it to UTF-8 bytes.
Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!

获取一个字符串并将其转换为 UTF-8 字节。
获取这些 UTF-8 字节并将它们转换为 UTF-16LE 字节。（您可以在第一步中直接编码为 UTF-16LE 而不是 UTF-8。）
获取这些 UTF-16LE 字节并将它们转换回字符串。这为您提供了与您最初拥有的完全相同的字符串！

So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.

所以这个功能是多余的；实际上，您可以将普通字符串从 .NET 传递给 SQL Server，而不必担心。

The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.

带有反斜杠的位确实做了一些事情，大概是特定于应用程序的我不明白它的用途。但是该函数中的任何内容都不会导致 Windows 将 ń 之类的字符展平为 n。

What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.

什么 / 将 / 导致这种扁平化是当您尝试将不在数据库自身编码中的字符放入数据库时。大概 é 是可以的，因为该字符在您的 cp1252 西欧的默认编码中，但 ń 不是所以它会被破坏。

SQL Server does use ‘UCS2' (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

SQL Server 确实使用“UCS2”（又是 UTF-16LE）来存储 Unicode 字符串，但您已经告诉它，通常使用 NATIONAL CHARACTER (NCHAR/NVARCHAR) 列类型而不是普通 CHAR。

Answer 2

回答by CraftyFella

We were also very confused about encoding. Here is an useful page that explains it.Also, answer to following SO question will help to explain it too -

我们也对编码感到非常困惑。这是一个解释它的有用页面。此外，回答以下 SO 问题也将有助于解释它-

In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

在 C# 字符串/字符编码中，GetBytes()、GetString() 和 Convert() 之间有什么区别？

Answer 3

回答by Chris Chadwick

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.

SQL Server 2008 会为您处理从 UTF-8 到 UCS-2 的转换。

First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.

首先确保您的 SQL 表对列使用 nchar、nvarchar 数据类型。然后，您需要通过在编码字符串前添加 N 来告诉 SQL Server 您以 Unicode 数据发送。

INSERT INTO tblTest (test) VALUES (N'EncodedString')

from Microsoft http://support.microsoft.com/kb/239530

来自微软 http://support.microsoft.com/kb/239530

See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?

请在此处查看我的问题和解决方案：如何将 UTF-8 数据从 Classic asp Form post 转换为 UCS-2 以插入 SQL Server 2008 r2？

Answer 4

回答by Charles Burns

For future readers using newer releases, note that SQL Server 2016 supports UTF-8 in their bcputility.

对于使用较新版本的未来读者，请注意 SQL Server 2016 在其bcp实用程序中支持 UTF-8 。

C# 将 UTF8 数据插入 SQL Server 2008

提问by Aaginor

采纳答案by bobince

回答by CraftyFella

回答by Chris Chadwick

回答by Charles Burns

相关推荐

最近更新

标签

C# 将 UTF8 数据插入 SQL Server 2008

提问by Aaginor

采纳答案by bobince

回答by CraftyFella

回答by Chris Chadwick

回答by Charles Burns

相关推荐

C# IList<int> 与 List<int>

C# 使用 guid 进行测试...如何将变量设置为 Guid？

C# log4net 记录所有未处理的应用程序错误

如何将 unicode 字符串输出到 RTF（使用 C#）

相关推荐

最近更新

标签