Linux 如何从文本文件中删除非 UTF-8 字符

Question

提问by Hakim

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

我有一堆用 utf-8 编码的阿拉伯语、英语、俄语文件。尝试使用 Perl 脚本处理这些文件时，出现此错误：

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

手动检查这些文件的内容，我发现其中有一些奇怪的字符。现在我正在寻找一种方法来自动从文件中删除这些字符。

Is there anyway to do it?

有没有办法做到这一点？

Answer 1

回答by Charles KnNell

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

您的方法必须逐字节读取并完全理解和欣赏字符的字节构造。最简单的方法是使用一个可以读取任何内容但只输出 UTF-8 字符的编辑器。文本板是一种选择。

Answer 2

回答by Palantir

This command:

这个命令：

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

将清理您的 UTF-8 文件，跳过所有无效字符。

-f is the source format
-t the target format
-c skips any invalid sequence

Answer 3

回答by atul jha

cat foo.txt | strings -n 8 > bar.txt

will do the job.

会做的工作。

Linux 如何从文本文件中删除非 UTF-8 字符

提问by Hakim

回答by Charles KnNell

回答by Palantir

回答by atul jha

相关推荐

最近更新

标签

Linux 如何从文本文件中删除非 UTF-8 字符

提问by Hakim

回答by Charles KnNell

回答by Palantir

回答by atul jha

相关推荐

C# 代码不会编译。null 和 int 之间没有隐式转换

LINUX C中stdout和STDOUT_FILENO的区别

Linux 内核栈和用户空间栈

C# | 之间的区别 和 || 或 & 和 && 进行比较

相关推荐

最近更新

标签

C# | 之间的区别和 || 或 & 和 && 进行比较