将 HTML 实体转换为字符的 Bash 脚本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5929492/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash script to convert from HTML entities to characters
提问by Marko
I'm looking for a way to turn this:
我正在寻找一种方法来解决这个问题:
hello < world
to this:
对此:
hello < world
I could use sed, but how can this be accomplished without using cryptic regex?
我可以使用 sed,但是如何在不使用神秘正则表达式的情况下完成此操作?
回答by ceving
Try recode(archived page; GitHub mirror; Debian page):
尝试重新编码(存档页面;GitHub 镜像;Debian 页面):
$ echo '<' |recode html..ascii
<
Install on Linux and similar Unix-y systems:
在 Linux 和类似的 Unix-y 系统上安装:
$ sudo apt-get install recode
Install on Mac OS using:
使用以下命令在 Mac OS 上安装:
$ brew install recode
回答by user1788934
With perl:
使用 perl:
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
With php from the command line:
使用命令行中的 php:
cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'
回答by Whitecat
An alternative is to pipe through a web browser -- such as:
另一种方法是通过网络浏览器进行管道传输——例如:
echo '!' | w3m -dump -T text/html
echo '!' | w3m -dump -T text/html
This worked great for me in cygwin, where downloading and installing distributions are difficult.
这在 cygwin 中对我很有用,在那里下载和安装发行版很困难。
This answer was found here
这个答案是在这里找到的
回答by user243
Using xmlstarlet:
使用 xmlstarlet:
echo 'hello < world' | xmlstarlet unesc
回答by WinEunuuchs2Unix
This answer is based on: Short way to escape HTML in Bash?which works fine for grabbing answers (using wget
) on Stack Exchange and converting HTML to regular ASCII characters:
这个答案基于:Short way to escape HTML in Bash? 它适用于wget
在 Stack Exchange 上获取答案(使用)并将 HTML 转换为常规 ASCII 字符:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1:April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers
编辑 1:2017 年 4 月 7 日 - 添加了左双引号和右双引号转换。这是 bash 脚本的一部分,用于在此处抓取 SE 答案并将它们与本地代码文件进行比较:Ask Ubuntu - Code Version Control between local files 和 Ask Ubuntu answers
Edit June 26, 2017
编辑 2017 年 6 月 26 日
Using sed
was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
sed
在来自 Ask Ubuntu / Stack Exchange 的 1K 行文件上,使用需要大约 3 秒才能将 HTML 转换为 ASCII。因此,我被迫使用 Bash 内置搜索并替换大约 1 秒的响应时间。
Here's the function:
这是函数:
#-------------------------------------------------------------------------------
LineOut="" # Make global
HTMLtoText () {
LineOut= # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()
回答by Aissen
A python 3.2+ version:
python 3.2+版本:
cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
回答by unagi
To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.
仅使用 sed 替换支持对所有 HTML 实体进行转义将需要太长的命令列表而不实用,因为每个 Unicode 代码点至少有两个对应的 HTML 实体。
But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):
但它只能使用 sed、grep、Bourne shell 和基本的 UNIX 实用程序(GNU coreutils 或等效程序)来完成:
#!/bin/sh
htmlEscDec2Hex() {
file=
[ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
printf -- \
"$(sed 's/\/\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
$(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
[ x"" != x"$file" ] && rm -f -- "$file"
}
htmlHexUnescape() {
printf -- "$(
sed 's/\/\\/g;s/%/%%/g
;s/&#x\([0-9a-fA-F]\{1,8\}\);/\�/g
;s/�*\([0-9a-fA-F]\{4\}\);/\u/g
;s/�*\([0-9a-fA-F]\{8\}\);/\U/g' )\n"
}
htmlEscDec2Hex "" | htmlHexUnescape \
| sed -f named_entities.sed
Note, however, that a printf implementation supporting \uHHHH
and \UHHHHHHHH
sequences is required, such as the GNU utility's. To test, check for example that printf "\u00A7\n"
prints §
. To call the utility instead of the shell built-in, replace the occurrences of printf
with env printf
.
但是请注意,需要支持\uHHHH
和\UHHHHHHHH
序列的 printf 实现,例如 GNU 实用程序。要测试,请检查例如printf "\u00A7\n"
打印§
. 要调用程序,而不是内置的外壳,更换的发生printf
有env printf
。
This script uses an additional file, named_entities.sed
, in order to support the named entities. It can be generated from the specification using the following HTML page:
此脚本使用附加文件 ,named_entities.sed
以支持命名实体。它可以使用以下 HTML 页面从规范中生成:
<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
const referenceURL = 'https://html.spec.whatwg.org/entities.json';
function writeln(element, text) {
element.appendChild( document.createTextNode(text) );
element.appendChild( document.createElement("br") );
}
(async function(container) {
const json = await (await fetch(referenceURL)).json();
container.innerHTML = "";
writeln(container, "#!/usr/bin/sed -f");
const addLast = [];
for (const name in json) {
const characters = json[name].characters
.replace("\", "\\")
.replace("/", "\/");
const command = "s/" + name + "/" + characters + "/g";
if ( name.endsWith(";") ) {
writeln(container, command);
} else {
addLast.push(command);
}
}
for (const command of addLast) { writeln(container, command); }
})( document.getElementById("sed-script") );
</script>
</body></html>
Simply open it in a modern browser, and save the resulting page as text as named_entities.sed
. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.
只需在现代浏览器中打开它,然后将结果页面另存为文本文件named_entities.sed
。如果只需要命名实体,这个 sed 脚本也可以单独使用;在这种情况下,给它可执行权限是很方便的,以便可以直接调用它。
Now the above shell script can be used as ./html_unescape.sh foo.html
, or inside a pipeline reading from standard input.
现在,上面的 shell 脚本可以用作./html_unescape.sh foo.html
,或者在从标准输入读取的管道中使用。
For example, if for some reason it is needed to process the data by chunks (it might be the case if printf
is not a shell built-in and the data to process is large), one could use it as:
例如,如果由于某种原因需要按块处理数据(如果printf
不是内置的 shell 并且要处理的数据很大,则可能是这种情况),可以将其用作:
nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done
Explanation of the script follows.
脚本的解释如下。
There are three types of escape sequences that need to be supported:
需要支持三种类型的转义序列:
&#D;
whereD
is the decimal value of the escaped character's Unicode code point;&#xH;
whereH
is the hexadecimal value of the escaped character's Unicode code point;&N;
whereN
is the name of one of the named entities for the escaped character.
&#D;
其中D
是转义字符的 Unicode 代码点的十进制值;&#xH;
其中H
是转义字符的 Unicode 代码点的十六进制值;&N;
其中N
是转义字符的命名实体之一的名称。
The &N;
escapes are supported by the generated named_entities.sed
script which simply performs the list of substitutions.
该&N;
逃逸被生成的支持named_entities.sed
脚本,简单地执行替换名单。
The central piece of this method for supporting the code point escapes is the printf
utility, which is able to:
这种支持代码点转义的方法的核心部分是printf
实用程序,它能够:
print numbers in hexadecimal format, and
print characters from their code point's hexadecimal value (using the escapes
\uHHHH
or\UHHHHHHHH
).
以十六进制格式打印数字,以及
从其代码点的十六进制值打印字符(使用转义符
\uHHHH
或\UHHHHHHHH
)。
The first feature, with some help from sed and grep, is used to reduce the &#D;
escapes into &#xH;
escapes. The shell function htmlEscDec2Hex
does that.
第一个特性,在 sed 和 grep 的帮助下,用于将&#D;
转义减少为&#xH;
转义。shell 函数就是htmlEscDec2Hex
这样做的。
The function htmlHexUnescape
uses sed to transform the &#xH;
escapes into printf's \u
/\U
escapes, then uses the second feature to print the unescaped characters.
该函数htmlHexUnescape
使用 sed将转义符转换&#xH;
为 printf 的\u
/\U
转义符,然后使用第二个功能打印未转义的字符。