将 HTML 实体转换为字符的 Bash 脚本

Question

提问by Marko

I'm looking for a way to turn this:

我正在寻找一种方法来解决这个问题：

hello &lt; world

to this:

对此：

hello < world

I could use sed, but how can this be accomplished without using cryptic regex?

我可以使用 sed，但是如何在不使用神秘正则表达式的情况下完成此操作？

Answer 1

回答by ceving

Try recode(archived page; GitHub mirror; Debian page):

尝试重新编码（存档页面；GitHub 镜像；Debian 页面）：

$ echo '&lt;' |recode html..ascii
<

Install on Linux and similar Unix-y systems:

在 Linux 和类似的 Unix-y 系统上安装：

$ sudo apt-get install recode

Install on Mac OS using:

使用以下命令在 Mac OS 上安装：

$ brew install recode

Answer 2

回答by user1788934

With perl:

使用 perl：

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

With php from the command line:

使用命令行中的 php：

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

Answer 3

回答by Whitecat

An alternative is to pipe through a web browser -- such as:

另一种方法是通过网络浏览器进行管道传输——例如：

echo '!' | w3m -dump -T text/html

This worked great for me in cygwin, where downloading and installing distributions are difficult.

这在 cygwin 中对我很有用，在那里下载和安装发行版很困难。

This answer was found here

这个答案是在这里找到的

Answer 4

回答by user243

Using xmlstarlet:

使用 xmlstarlet：

echo 'hello &lt; world' | xmlstarlet unesc

Answer 5

回答by WinEunuuchs2Unix

This answer is based on: Short way to escape HTML in Bash?which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

这个答案基于：Short way to escape HTML in Bash? 它适用于wget在 Stack Exchange 上获取答案（使用）并将 HTML 转换为常规 ASCII 字符：

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

Edit 1:April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers

编辑 1：2017 年 4 月 7 日 - 添加了左双引号和右双引号转换。这是 bash 脚本的一部分，用于在此处抓取 SE 答案并将它们与本地代码文件进行比较：Ask Ubuntu - Code Version Control between local files 和 Ask Ubuntu answers

Edit June 26, 2017

编辑 2017 年 6 月 26 日

Using sedwas taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

sed在来自 Ask Ubuntu / Stack Exchange 的 1K 行文件上，使用需要大约 3 秒才能将 HTML 转换为 ASCII。因此，我被迫使用 Bash 内置搜索并替换大约 1 秒的响应时间。

Here's the function:

这是函数：

#-------------------------------------------------------------------------------
LineOut=""      # Make global
HTMLtoText () {
    LineOut=  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

Answer 6

回答by Aissen

A python 3.2+ version:

python 3.2+版本：

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

Answer 7

回答by unagi

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.

仅使用 sed 替换支持对所有 HTML 实体进行转义将需要太长的命令列表而不实用，因为每个 Unicode 代码点至少有两个对应的 HTML 实体。

But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):

但它只能使用 sed、grep、Bourne shell 和基本的 UNIX 实用程序（GNU coreutils 或等效程序）来完成：

#!/bin/sh

htmlEscDec2Hex() {
    file=
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\/\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\/\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000;/g
            ;s/&#x0*\([0-9a-fA-F]\{4\}\);/\u/g
            ;s/&#x0*\([0-9a-fA-F]\{8\}\);/\U/g' )\n"
}

htmlEscDec2Hex "" | htmlHexUnescape \
    | sed -f named_entities.sed

Note, however, that a printf implementation supporting \uHHHHand \UHHHHHHHHsequences is required, such as the GNU utility's. To test, check for example that printf "\u00A7\n"prints §. To call the utility instead of the shell built-in, replace the occurrences of printfwith env printf.

但是请注意，需要支持\uHHHH和\UHHHHHHHH序列的 printf 实现，例如 GNU 实用程序。要测试，请检查例如printf "\u00A7\n"打印§. 要调用程序，而不是内置的外壳，更换的发生printf有env printf。

This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:

此脚本使用附加文件，named_entities.sed以支持命名实体。它可以使用以下 HTML 页面从规范中生成：

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\", "\\")
        .replace("/", "\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.

只需在现代浏览器中打开它，然后将结果页面另存为文本文件named_entities.sed。如果只需要命名实体，这个 sed 脚本也可以单独使用；在这种情况下，给它可执行权限是很方便的，以便可以直接调用它。

Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.

现在，上面的 shell 脚本可以用作./html_unescape.sh foo.html，或者在从标准输入读取的管道中使用。

For example, if for some reason it is needed to process the data by chunks (it might be the case if printfis not a shell built-in and the data to process is large), one could use it as:

例如，如果由于某种原因需要按块处理数据（如果printf不是内置的 shell 并且要处理的数据很大，则可能是这种情况），可以将其用作：

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done

Explanation of the script follows.

脚本的解释如下。

There are three types of escape sequences that need to be supported:

需要支持三种类型的转义序列：

&#D;where Dis the decimal value of the escaped character's Unicode code point;
&#xH;where His the hexadecimal value of the escaped character's Unicode code point;
&N;where Nis the name of one of the named entities for the escaped character.

&#D;其中D是转义字符的 Unicode 代码点的十进制值；
&#xH;其中H是转义字符的 Unicode 代码点的十六进制值；
&N;其中N是转义字符的命名实体之一的名称。

The &N;escapes are supported by the generated named_entities.sedscript which simply performs the list of substitutions.

该&N;逃逸被生成的支持named_entities.sed脚本，简单地执行替换名单。

The central piece of this method for supporting the code point escapes is the printfutility, which is able to:

这种支持代码点转义的方法的核心部分是printf实用程序，它能够：

print numbers in hexadecimal format, and
print characters from their code point's hexadecimal value (using the escapes \uHHHHor \UHHHHHHHH).

以十六进制格式打印数字，以及
从其代码点的十六进制值打印字符（使用转义符\uHHHH或\UHHHHHHHH）。

The first feature, with some help from sed and grep, is used to reduce the &#D;escapes into &#xH;escapes. The shell function htmlEscDec2Hexdoes that.

第一个特性，在 sed 和 grep 的帮助下，用于将&#D;转义减少为&#xH;转义。shell 函数就是htmlEscDec2Hex这样做的。

The function htmlHexUnescapeuses sed to transform the &#xH;escapes into printf's \u/\Uescapes, then uses the second feature to print the unescaped characters.

该函数htmlHexUnescape使用 sed将转义符转换&#xH;为 printf 的\u/\U转义符，然后使用第二个功能打印未转义的字符。

Answer 8

回答by Reino

With Xidel:

与西德尔：

echo 'hello &lt; &#x3a; &quot; world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

将 HTML 实体转换为字符的 Bash 脚本

提问by Marko

回答by ceving

回答by user1788934

回答by Whitecat

回答by user243

回答by WinEunuuchs2Unix

Edit June 26, 2017

编辑 2017 年 6 月 26 日

回答by Aissen

回答by unagi

回答by Reino

相关推荐

最近更新

标签

将 HTML 实体转换为字符的 Bash 脚本

提问by Marko

回答by ceving

回答by user1788934

回答by Whitecat

回答by user243

回答by WinEunuuchs2Unix

Edit June 26, 2017

编辑 2017 年 6 月 26 日

回答by Aissen

回答by unagi

回答by Reino

相关推荐

Html 如何创建纯文本按钮？

使用 express.js 在 node.js 中提供 html 的最佳实践是什么？

Html 如何在所有屏幕尺寸的网页上居中图像

如何正确地将背景颜色强加到 div 包装文章和 HTML/CSS 中的一边？

相关推荐

最近更新

标签