HTML 中哪些字符需要转义?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7381974/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 10:34:13  来源:igfitidea点击:

Which characters need to be escaped in HTML?

htmlhtml-entitieshtml-encodehtml-escape-characters

提问by Ahmet

Are they the same as XML, perhaps plus the space one ( )?

它们是否与 XML 相同,也许加上空格一 (  )?

I've found some huge lists of HTML escape characters but I don't think they mustbe escaped. I want to know what needsto be escaped.

我发现的HTML转义字符一些大名单,但我不认为他们必须进行转义。我想知道需要转义什么。

回答by Facebook Staff are Complicit

If you're inserting text content in your document in a location where text content is expected1, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand &and the element delimiter less-than and greater-than signs < >:

如果您在文档中预期文本内容为1的位置插入文本内容,您通常只需要转义与 XML 中相同的字符。在元素内部,这仅包括实体转义符&和元素分隔符小于和大于号< >

& becomes &amp;
< becomes &lt;
> becomes &gt;

Inside of attribute values you must also escape the quote character you're using:

在属性值内部,您还必须转义您正在使用的引号字符:

" becomes &quot;
' becomes &#39;

In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.

在某些情况下,跳过其中一些字符的转义可能是安全的,但我鼓励您在所有情况下都转义这五个字符以减少出错的机会。

If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.

如果您的文档编码不支持您使用的所有字符,例如如果您尝试在 ASCII 编码的文档中使用表情符号,您还需要转义这些字符。如今,大多数文档都使用完全支持 Unicode 的 UTF-8 编码进行编码,而这并不是必需的。

In general, you should not escape spaces as &nbsp;. &nbsp;is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert          extra        space       without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.

通常,您不应将空格作为&nbsp;. &nbsp;不是一个普通的空间,它是一个不间断的空间。您可以使用这些代替普通空格来防止在两个单词之间插入换行符,或者在不自动折叠的情况下插入额外的空格,但这通常是一种罕见的情况。除非您有需要它的设计约束,否则不要这样做。



1By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example: <p>HERE</p>or <p title="HERE">...</p>. What I wrote above does not applyto content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example: <NOT-HERE>...</NOT-HERE>, <script>NOT-HERE</script>, <style>NOT-HERE</script>, or <p NOT-HERE="...">...</p>.

1通过“预期文本内容的位置”,我的意思是在应用正常解析规则的元素或引用的属性值内部。例如:<p>HERE</p><p title="HERE">...</p>。我上面写的内容不适用于具有特殊解析规则或含义的内容,例如在脚本或样式标签内部,或者作为元素或属性名称。例如:<NOT-HERE>...</NOT-HERE><script>NOT-HERE</script><style>NOT-HERE</script>,或<p NOT-HERE="...">...</p>

In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations.I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.

在这些上下文中,规则更加复杂,并且更容易引入安全漏洞。我强烈建议您不要在任何这些位置插入动态内容。我见过有能力的具有安全意识的开发人员团队通过假设他们对这些值进行了正确编码,但遗漏了边缘情况而引入了漏洞。通常有一个更安全的替代方法,例如将动态值放在一个属性中,然后用 JavaScript 处理它。

If you must, please read the Open Web Application Security Project's XSS Prevention Rulesto help understand some of the concerns you will need to keep in mind.

如果必须,请阅读开放 Web 应用程序安全项目的 XSS 预防规则,以帮助了解您需要牢记的一些问题。

回答by daxelrod

It depends upon the context. Some possible contexts in HTML:

这取决于上下文。HTML 中的一些可能的上下文:

  • document body
  • inside common attributes
  • inside script tags
  • inside style tags
  • several more!
  • 文件正文
  • 内部公共属性
  • 脚本标签内
  • 内部样式标签
  • 还有几个!

See OWASP's Cross Site Scripting Prevention Cheat Sheet, especially the "Why Can't I Just HTML Entity Encode Untrusted Data?" and "XSS Prevention Rules" sections. However, it's best to read the whole document.

请参阅 OWASP 的跨站点脚本预防备忘单,尤其是“为什么我不能只对 HTML 实体编码不受信任的数据?”和“ XSS 预防规则”部分。但是,最好阅读整个文档。

回答by Alireza

Basically, there are three main characterswhich should be always escaped in your HTML and XML files, so they don't interact with the rest of the markups, so as you probably expect, two of them gonna be the syntax wrappers, which are <>, they are listed as below:

基本上,在 HTML 和 XML 文件中应该始终对三个主要字符进行转义,因此它们不会与其余标记交互,因此正如您可能期望的那样,其中两个将成为语法包装器,它们是 < >,分别列出如下:

 1)  &lt; (<)

 2)  &gt; (>)

 3)  &amp; (&)

Also we may use double-quote (") as " and the single quote (') as &apos

我们也可以使用双引号 (") 作为 " 和单引号 (') 作为 &apos

Avoid putting dynamic content in <script>and <style>.These rules are not for applied for them. For example, if you have to include JSON in a , replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialisation.)

避免在<script>和 中放入动态内容<style>。这些规则不适用于它们。例如,如果您必须在 a 中包含 JSON,请在 JSON 序列化后将 < 替换为 \x3c,将 U+2028 字符替换为 \u2028,将 U+2029 替换为 \u2029。)

HTML Escape Characters: Complete List: http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

HTML 转义字符:完整列表:http: //www.theukwebdesigncompany.com/articles/entity-escape-characters.php

So you need to escape <, or & when followed by anything that could begin a character reference. Also The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. But if you don't want to terminate the attribute value there, escape the quotation mark.

所以你需要转义 < 或 & 当后面跟着任何可以开始字符引用的东西。此外,&符号规则是唯一用于引用属性的规则,因为匹配的引号是唯一可以终止的规则。但是,如果您不想在那里终止属性值,请转义引号。

Changing to UTF-8 means re-saving your file:

Using the character encoding UTF-8 for your page means that you can avoid the need for most escapes and just work with characters. Note, however, that to change the encoding of your document, it is not enough to just change the encoding declaration at the top of the page or on the server. You need to re-save your document in that encoding. For help understanding how to do that with your application read Setting encoding in web authoring applications.

Invisible or ambiguous characters:

A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.

One example would be Unicode character U+200F RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however, so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ‏ (or its numeric character reference equivalent ‏) instead makes it very easy to spot these characters.

An example of an ambiguous character is U+00A0 NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using   makes it quite clear where such spaces appear in the text.

更改为 UTF-8 意味着重新保存您的文件:

为您的页面使用 UTF-8 字符编码意味着您可以避免大多数转义的需要,只需使用字符即可。但是请注意,要更改文档的编码,仅更改页面顶部或服务器上的编码声明是不够的。您需要以该编码重新保存您的文档。如需帮助了解如何使用您的应用程序执行此操作,请阅读在 Web 创作应用程序中设置编码。

不可见或不明确的字符:

转义的一个特别有用的作用是表示不可见或不明确的字符。

一个例子是 Unicode 字符 U+200F RIGHT-TO-LEFT MARK。此字符可用于阐明双向文本中的方向性(例如,在使用阿拉伯语或希伯来语脚本时)。然而,它没有图形形式,因此很难看到这些字符在文本中的位置,如果丢失或遗忘它们,可能会在以后的编辑中产生意想不到的结果。改用 (或它的数字字符引用等效项 )可以很容易地发现这些字符。

不明确字符的一个例子是 U+00A0 NO-BREAK SPACE。这种类型的空格可以防止换行,但当用作字符时,它看起来就像任何其他空格。使用可以让文本中出现此类空格的位置非常清楚。

回答by Andrey

The exact answer depends on the context. In general, these characters must not be present (HTML 5.2 §3.2.4.2.5):

确切的答案取决于上下文。一般来说,这些字符不能出现(HTML 5.2 §3.2.4.2.5):

Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context.

For elements in HTML, the constraints of the Text content model also depends on the kind of element. For instance, an "<" inside a textarea element does not need to be escaped in HTML because textarea is an escapable raw text element.

文本节点和属性值必须由 Unicode 字符组成,不得包含 U+0000 字符,不得包含永久未定义的 Unicode 字符(非字符),并且不得包含除空格字符以外的控制字符。该规范包括对 Text 节点的确切值和属性值的额外约束,具体取决于它们的精确上下文。

对于 HTML 中的元素,Text 内容模型的约束也取决于元素的种类。例如,textarea 元素中的“<”不需要在 HTML 中转义,因为 textarea 是一个可转义的原始文本元素。

These restrictions are scattered across the specification. E.g., attribute values (§8.1.2.3) must not contain an ambiguous ampersandand be either (i)empty, (ii)within single quotes (and thus must not contain U+0027 APOSTROPHE character '), (iii)within double quotes (must not contain U+0022 QUOTATION MARK character "), or (iv)unquoted — with the following restrictions:

这些限制分散在整个规范中。例如,属性值(第8.1.2.3 节)不得包含不明确的&符号,并且为(i)空,(ii)单引号内(因此不得包含 U+0027 撇号字符'),(iii)双引号内(不得包含 U+0022 QUOTATION MARK 字符"),或(iv)未加引号 — 具有以下限制:

... must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

... 不得包含任何文字空格字符、任何 U+0022 引号字符 (")、U+0027 撇号字符 (')、U+003D 等号字符 (=)、U+003C LESS-THAN SIGN 字符 ( <)、U+003E 大于号字符 (>) 或 U+0060 重音字符 (`),并且不得为空字符串。