Linux 如何从文本文件中删除非 UTF-8 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12999651/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 14:39:52  来源:igfitidea点击:

How to remove non UTF-8 characters from text file

linuxbashtextutf-8character-encoding

提问by Hakim

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

我有一堆用 utf-8 编码的阿拉伯语、英语、俄语文件。尝试使用 Perl 脚本处理这些文件时,出现此错误:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

手动检查这些文件的内容,我发现其中有一些奇怪的字符。现在我正在寻找一种方法来自动从文件中删除这些字符。

Is there anyway to do it?

有没有办法做到这一点?

回答by Charles KnNell

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

您的方法必须逐字节读取并完全理解和欣赏字符的字节构造。最简单的方法是使用一个可以读取任何内容但只输出 UTF-8 字符的编辑器。文本板是一种选择。

回答by Palantir

This command:

这个命令:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

将清理您的 UTF-8 文件,跳过所有无效字符。

-f is the source format
-t the target format
-c skips any invalid sequence

回答by atul jha

cat foo.txt | strings -n 8 > bar.txt

will do the job.

会做的工作。