将 HTML 实体转换为字符的 Bash 脚本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5929492/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 08:29:53  来源:igfitidea点击:

Bash script to convert from HTML entities to characters

htmlbashhtml-escape-characters

提问by Marko

I'm looking for a way to turn this:

我正在寻找一种方法来解决这个问题:

hello < world

to this:

对此:

hello < world

I could use sed, but how can this be accomplished without using cryptic regex?

我可以使用 sed,但是如何在不使用神秘正则表达式的情况下完成此操作?

回答by ceving

Try recode(archived page; GitHub mirror; Debian page):

尝试重新编码存档页面GitHub 镜像Debian 页面):

$ echo '&lt;' |recode html..ascii
<

Install on Linux and similar Unix-y systems:

在 Linux 和类似的 Unix-y 系统上安装:

$ sudo apt-get install recode

Install on Mac OS using:

使用以下命令在 Mac OS 上安装:

$ brew install recode

回答by user1788934

With perl:

使用 perl:

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

With php from the command line:

使用命令行中的 php:

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

回答by Whitecat

An alternative is to pipe through a web browser -- such as:

另一种方法是通过网络浏览器进行管道传输——例如:

echo '&#33;' | w3m -dump -T text/html

echo '&#33;' | w3m -dump -T text/html

This worked great for me in cygwin, where downloading and installing distributions are difficult.

这在 cygwin 中对我很有用,在那里下载和安装发行版很困难。

This answer was found here

这个答案是在这里找到的

回答by user243

Using xmlstarlet:

使用 xmlstarlet:

echo 'hello &lt; world' | xmlstarlet unesc

回答by WinEunuuchs2Unix

This answer is based on: Short way to escape HTML in Bash?which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

这个答案基于:Short way to escape HTML in Bash? 它适用于wget在 Stack Exchange 上获取答案(使用)并将 HTML 转换为常规 ASCII 字符:

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

Edit 1:April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers

编辑 1:2017 年 4 月 7 日 - 添加了左双引号和右双引号转换。这是 bash 脚本的一部分,用于在此处抓取 SE 答案并将它们与本地代码文件进行比较:Ask Ubuntu - Code Version Control between local files 和 Ask Ubuntu answers



Edit June 26, 2017

编辑 2017 年 6 月 26 日

Using sedwas taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

sed在来自 Ask Ubuntu / Stack Exchange 的 1K 行文件上,使用需要大约 3 秒才能将 HTML 转换为 ASCII。因此,我被迫使用 Bash 内置搜索并替换大约 1 秒的响应时间。

Here's the function:

这是函数:

#-------------------------------------------------------------------------------
LineOut=""      # Make global
HTMLtoText () {
    LineOut=  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

回答by Aissen

A python 3.2+ version:

python 3.2+版本:

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

回答by unagi

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.

仅使用 sed 替换支持对所有 HTML 实体进行转义将需要太长的命令列表而不实用,因为每个 Unicode 代码点至少有两个对应的 HTML 实体。

But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):

但它只能使用 sed、grep、Bourne shell 和基本的 UNIX 实用程序(GNU coreutils 或等效程序)来完成:

#!/bin/sh

htmlEscDec2Hex() {
    file=
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\/\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\/\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000;/g
            ;s/&#x0*\([0-9a-fA-F]\{4\}\);/\u/g
            ;s/&#x0*\([0-9a-fA-F]\{8\}\);/\U/g' )\n"
}

htmlEscDec2Hex "" | htmlHexUnescape \
    | sed -f named_entities.sed

Note, however, that a printf implementation supporting \uHHHHand \UHHHHHHHHsequences is required, such as the GNU utility's. To test, check for example that printf "\u00A7\n"prints §. To call the utility instead of the shell built-in, replace the occurrences of printfwith env printf.

但是请注意,需要支持\uHHHH\UHHHHHHHH序列的 printf 实现,例如 GNU 实用程序。要测试,请检查例如printf "\u00A7\n"打印§. 要调用程序,而不是内置的外壳,更换的发生printfenv printf

This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:

此脚本使用附加文件 ,named_entities.sed以支持命名实体。它可以使用以下 HTML 页面从规范中生成:

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\", "\\")
        .replace("/", "\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.

只需在现代浏览器中打开它,然后将结果页面另存为文本文件named_entities.sed。如果只需要命名实体,这个 sed 脚本也可以单独使用;在这种情况下,给它可执行权限是很方便的,以便可以直接调用它。

Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.

现在,上面的 shell 脚本可以用作./html_unescape.sh foo.html,或者在从标准输入读取的管道中使用。

For example, if for some reason it is needed to process the data by chunks (it might be the case if printfis not a shell built-in and the data to process is large), one could use it as:

例如,如果由于某种原因需要按块处理数据(如果printf不是内置的 shell 并且要处理的数据很大,则可能是这种情况),可以将其用作:

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done


Explanation of the script follows.

脚本的解释如下。

There are three types of escape sequences that need to be supported:

需要支持三种类型的转义序列:

  1. &#D;where Dis the decimal value of the escaped character's Unicode code point;

  2. &#xH;where His the hexadecimal value of the escaped character's Unicode code point;

  3. &N;where Nis the name of one of the named entities for the escaped character.

  1. &#D;其中D是转义字符的 Unicode 代码点的十进制值;

  2. &#xH;其中H是转义字符的 Unicode 代码点的十六进制值;

  3. &N;其中N是转义字符的命名实体之一的名称。

The &N;escapes are supported by the generated named_entities.sedscript which simply performs the list of substitutions.

&N;逃逸被生成的支持named_entities.sed脚本,简单地执行替换名单。

The central piece of this method for supporting the code point escapes is the printfutility, which is able to:

这种支持代码点转义的方法的核心部分是printf实用程序,它能够:

  1. print numbers in hexadecimal format, and

  2. print characters from their code point's hexadecimal value (using the escapes \uHHHHor \UHHHHHHHH).

  1. 以十六进制格式打印数字,以及

  2. 从其代码点的十六进制值打印字符(使用转义符\uHHHH\UHHHHHHHH)。

The first feature, with some help from sed and grep, is used to reduce the &#D;escapes into &#xH;escapes. The shell function htmlEscDec2Hexdoes that.

第一个特性,在 sed 和 grep 的帮助下,用于将&#D;转义减少为&#xH;转义。shell 函数就是htmlEscDec2Hex这样做的。

The function htmlHexUnescapeuses sed to transform the &#xH;escapes into printf's \u/\Uescapes, then uses the second feature to print the unescaped characters.

该函数htmlHexUnescape使用 sed将转义符转换&#xH;为 printf 的\u/\U转义符,然后使用第二个功能打印未转义的字符。

回答by Reino

With Xidel:

西德尔

echo 'hello &lt; &#x3a; &quot; world' | xidel -s - -e 'parse-html($raw)'
hello < : " world