Linux UnicodeDecodeError: 'utf8' 编解码器无法解码字节 0x9c

Question

提问by transilvlad

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

我有一个套接字服务器，它应该从客户端接收 UTF-8 有效字符。

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

问题是一些客户端（主要是黑客）正在通过它发送所有错误类型的数据。

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

我可以轻松区分真正的客户端，但我将所有发送的数据记录到文件中，以便稍后进行分析。

Sometimes I get characters like this ?that cause the UnicodeDecodeErrorerror.

有时我会得到这样的字符?导致UnicodeDecodeError错误。

I need to be able to make the string UTF-8 with or without those characters.

我需要能够制作带有或不带有这些字符的字符串 UTF-8。

Update:

更新：

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

对于我的特殊情况，套接字服务是 MTA，因此我只希望接收 ASCII 命令，例如：

EHLO example.com
MAIL FROM: <[email protected]>
...

I was logging all of this in JSON.

我在 JSON 中记录了所有这些。

Then some folks out there without good intentions decided to sell all kind of junk.

然后一些没有好意的人决定出售各种垃圾。

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

这就是为什么对于我的特定情况，去除非 ASCII 字符是完全可以的。

Answer 1

采纳答案by transilvlad

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

或者

str = unicode(str, errors='ignore')

Note:This will strip out (ignore) the characters in question returning the string without them.

注意：这将删除（忽略）有问题的字符，返回没有它们的字符串。

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

对我来说，这是理想的情况，因为我使用它来保护我的应用程序不允许的非 ASCII 输入。

Alternatively:Use the open method from the codecsmodule to read in the file:

或者：使用codecs模块中的 open 方法读入文件：

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

Answer 2

回答by Ignacio Vazquez-Abrams

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
?

Answer 3

回答by workplaylifecycle

Just in case of someone has the same problem. I'am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

以防万一有人遇到同样的问题。我将 vim 与YouCompleteMe一起使用，无法使用此错误消息启动 ycmd，我所做的是：export LC_CTYPE="en_US.UTF-8"，问题消失了。

Answer 4

回答by James McCormac

This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

现在我已经转移到 Python 3，这种类型的问题突然出现了。我不知道 Python 2 只是简单地解决了文件编码方面的任何问题。

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

我找到了对差异的很好的解释，以及在上述方法都不适合我之后如何找到解决方案。

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

简而言之，要使 Python 3 的行为与 Python 2 尽可能相似，请使用：

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

然而，阅读这篇文章，没有一刀切的解决方案。

Answer 5

回答by maiky_forrester

I had same problem with UnicodeDecodeErrorand i solved it with this line. Don't know if is the best way but it worked for me.

我有同样的问题，UnicodeDecodeError我用这条线解决了它。不知道是否是最好的方法，但它对我有用。

str = str.decode('unicode_escape').encode('utf-8')

Answer 6

回答by Do?u?

Changing the engine from C to Python did the trick for me.

将引擎从 C 更改为 Python 对我来说很有效。

Engine is C:

引擎是C：

pd.read_csv(gdp_path, sep='\t', engine='c')

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

“utf-8”编解码器无法解码位置 18 中的字节 0x92：起始字节无效

Engine is Python:

引擎是 Python：

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.

对我来说没有错误。

Answer 7

回答by Kothapati Purandhar Reddy

What can you do if you need to make a change to a file, but don't know the file's encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:

如果您需要对文件进行更改，但不知道文件的编码，该怎么办？如果您知道编码与 ASCII 兼容并且只想检查或修改 ASCII 部分，您可以使用 surrogateescape 错误处理程序打开文件：

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

Answer 8

回答by Ivan Lee

the first,Using get_encoding_type to get the files type of encode:

首先，使用 get_encoding_type 获取文件类型的编码：

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second, opening the files with the type:

第二，打开具有以下类型的文件：

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

Linux UnicodeDecodeError: 'utf8' 编解码器无法解码字节 0x9c

提问by transilvlad

采纳答案by transilvlad

回答by Ignacio Vazquez-Abrams

回答by workplaylifecycle

回答by James McCormac

回答by maiky_forrester

回答by Do?u?

回答by Kothapati Purandhar Reddy

回答by Ivan Lee

相关推荐

最近更新

标签

Linux UnicodeDecodeError: 'utf8' 编解码器无法解码字节 0x9c

提问by transilvlad

采纳答案by transilvlad

回答by Ignacio Vazquez-Abrams

回答by workplaylifecycle

回答by James McCormac

回答by maiky_forrester

回答by Do?u?

回答by Kothapati Purandhar Reddy

回答by Ivan Lee

相关推荐

C# ListView 列宽自动

Linux pthread_create 可以创建的最大线程数是多少？

C# 在数据库中插入数据集记录

在 Linux 中测量时间 - 时间 vs 时钟 vs getrusage vs clock_gettime vs gettimeofday vs timespec_get？

相关推荐

最近更新

标签