Html 为什么“®”被呈现为“？” 没有分号

Question

提问by Spanky

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be

我遇到了一个问题，该问题是通过我们的 Google Adwords 驱动的营销活动揭示的。使用的标准参数之一是“区域”。当用户搜索并点击赞助商链接时，Google 会生成一个长 URL 来跟踪点击并在引荐来源中发送一堆内容。我们为我们的记录捕获了这一点，我们注意到“区域”参数错误地通过。应该是什么

http://ravercats.com/meow?foo=bar&region=catnip

is instead coming through as:

而是通过：

http://ravercats.com/meow?foo=bar?ion=catnip

I've verified that this occurs in all browsers. It's my understanding that HTML entity syntaxis defined as follows:

我已经验证这发生在所有浏览器中。我的理解是HTML 实体语法定义如下：

&VALUE;

where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ? entity, and it's wreaking all kinds of havoc throughout our system.

其中前导边界是与号，结束边界是分号。看起来够直观。问题是这并没有得到尊重？实体，它正在我们的系统中造成各种破坏。

Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.

有谁知道为什么会这样？这是 DTD 中的错误吗？（我正在寻找当前的 HTML DTD，看看我是否能理解它）我试图找出浏览器之间的共同点来实现这一点，因此我正在寻找 DTD。

Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:

这是您可以使用的证明。取出这段代码，用它制作一个 HTML 文件并在浏览器中呈现它：

<html>
<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</html>

EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".

编辑：对于所有建议我需要转义整个 URL 的人来说，上面的示例 URL 正是如此，示例。真正的 URL 直接来自 Google，我无法控制它的构造方式。这些建议虽然有效，但并没有回答这个问题：“为什么会这样”。

Answer 1

采纳答案by Alohci

Although validcharacter references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognised by modern browsers' HTML parsers.

尽管有效字符引用的末尾总是有一个分号，但出于向后兼容性的原因，现代浏览器的 HTML 解析器会识别一些没有分号的无效命名字符引用。

Either you know what that entire list is, or you follow the HTML5 rules for when &is valid without being escaped (e,g, when followed by a space) or otherwise always escape &as &whenever in doubt.

要么你知道，整个名单是什么，或者你遵循当HTML5规则&是不被转义有效（E，G，随后在一个空格），或者以其他方式总是逃避&如&如有任何疑问。

For reference, the full list of named character references that are recognised without a semicolon is:

作为参考，无需分号即可识别的命名字符引用的完整列表是：

AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil, ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT, Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN, Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig, agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy, curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14, frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt, macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf, ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg, sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc, ugrave, uml, uuml, yacute, yen, yuml

AElig、AMP、Aacute、Acirc、Agrave、Aring、Atilde、Auml、COPY、Ccedil、ETH、Eacute、Ecirc、Egrave、Euml、GT、Iacute、Icirc、Igrave、Iuml、LT、Ntilde、Oacute、Ocirc、Ograve、 Oslash、Otilde、Ouml、QUOT、REG、THORN、Uacute、Ucirc、Ugrave、Uuml、Yacute、aacute、acirc、acute、aelig、agrave、amp、aring、atilde、auml、brvbar、ccedil、cedil、cent、copy、当前，度，划分，eacute，ecirc，egrave，eth，euml，frac12，frac14，frac34，gt，iacute，icirc，iexcl，igrave，iquest，iuml，laquo，lt，macr，micro，middot，nbsp，不， ntilde、oacute、ocirc、ograve、ordf、ordm、oslash、otilde、ouml、para、plusmn、磅、quot、raquo、reg、sect、shy、sup1、sup2、sup3、szlig、thorn、times、circu、uacute、 ugrave, uml, uuml, yacute, 日元, yuml

However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a =or a alphanumeric ASCII character.

但是，应该注意的是，仅当在属性值中时，如果下一个字符是 a=或字母数字 ASCII 字符，则符合 HTML5 解析器的上述列表中的命名字符引用不会被如此处理。

For the full list of named character references with or without ending semicolons, see here

有关带或不带结尾分号的命名字符引用的完整列表，请参见此处

Answer 2

回答by Jukka K. Korpela

This is a very messy business and depends on context (text content vs. attribute value).

这是一项非常混乱的业务，取决于上下文（文本内容与属性值）。

Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region=would be syntactically correct but undefined, as entity regionhas not been defined. XHTML makes the trailing semicolon required.

形式上，根据 HTML 4.01 及包括 HTML 规范，如果下一个字符不是名称字符，则实体引用可以不带尾随分号出现。因此，例如&region=在语法上是正确的但未定义，因为实体region尚未定义。XHTML 要求尾随分号。

Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip"so that &regionis not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.

不过，浏览器传统上遵循其他规则。由于查询 URL 的通用语法，它们解析例如，href="http://ravercats.com/meow?foo=bar&region=catnip"因此&region不被视为实体引用，而只是文本数据。作者大多使用这种结构，即使它们在形式上是不正确的。

Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip"actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don't normally write URLs in text. In text, &region=gets processed so that &regis recognized as an entity reference (for “?”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character referencesdescribes the “double standard”:

与问题似乎在说什么相反，href="http://ravercats.com/meow?foo=bar&region=catnip"实际上效果很好。当字符串不在属性值中而是在文本内容中时会出现问题，这很不常见：我们通常不会在文本中编写 URL。在文本中，&region=被处理以便&reg被识别为实体引用（对于“？”），其余的只是字符数据。这种奇怪的行为在 HTML5 CR 中被正式化，其中第8.2.4.69条标记字符引用描述了“双重标准”：

If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or in the range ASCII digits, uppercase ASCII letters, or lowercase ASCII letters, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned.

如果字符引用作为属性的一部分被使用，并且匹配的最后一个字符不是“;” (U+003B) 字符，并且下一个字符是“=”(U+003D) 字符或在 ASCII 数字、大写 ASCII 字母或小写 ASCII 字母范围内，然后，由于历史原因，所有在 U+0026 AMPERSAND 字符 (&) 之后匹配的必须不使用，并且不返回任何内容。

Thus, in an attribute value, even &reg=would not be treated as containing a character reference, and still less &region=. (But reg_test=is a different case, due to the underscore character.)

因此，在属性值中，甚至&reg=不会被视为包含字符引用，更不会被视为包含字符引用&region=。（但reg_test=由于下划线字符，情况不同。）

In text content, other rules apply. The construct &region=causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &regis recognized as a character reference.

在文本内容中，其他规则适用。该构造&region=会导致解析错误（通过 HTML5 CR 规则），但具有明确定义的错误处理：&reg被识别为字符引用。

Answer 3

回答by jchapa

Maybe try replacing your &as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.

也许尝试替换你的&as &？＆符号也是必须在 HTML 中转义的字符，因为它们被保留用作实体的一部分。

Answer 4

回答by Salman A

1:The following markup is invalid in the first place (use the W3C Markup Validation Serviceto verify):

1：以下标记首先是无效的（使用W3C Markup Validation Service进行验证）：

<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct"></a>

In the above example, the &character should be encoded as &, like so:

在上面的示例中，&字符应编码为&，如下所示：

<a href="http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct"></a>

2:Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.

2：浏览器宽容；他们试图从破碎的 HTML 中找出意义。在您的情况下，所有可能有效的 HTML 实体都将转换为 HTML 实体。

Answer 5

回答by Frank Tudor

Here is a simple solution and it may not work in all instances.

这是一个简单的解决方案，它可能不适用于所有情况。

So from this:

所以从这个：

http://ravercats.com/meow?status=Online&region=Atlantis

To This:

对此：

http://ravercats.com/meow?region=Atlantis&status=Online

Because the &regas we know triggers the special character ?

因为&reg我们知道触发特殊字符?

Caveat:If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.

警告：如果您无法控制 URL 查询字符串参数的顺序，则必须将变量名称更改为其他名称。

Answer 6

回答by jjyepez

It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.

在我看来，您从 google 收到的不是实际的 URL，而是一个引用 url（查询字符串）的变量。所以，这就是为什么它在渲染时被解析为注册标记。

I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.

我会说，你应该对它进行 url 编码并在处理它时对其进行解码。像任何其他包含特殊实体的变量一样。

Answer 7

回答by Kzqai

Escape your output!

逃离你的输出！

Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url)or htmlentities($url)in php).

很简单，您需要将 url 格式编码为 html 格式以进行准确表示（理想情况下，您将使用模板引擎变量转义函数来执行此操作，但除此之外，使用phphtmlspecialchars($url)或htmlentities($url)在 php 中）。

See your test case and then the correctly encoded html at this jsfiddle: http://jsfiddle.net/tchalvakspam/Fp3W6/

查看您的测试用例，然后在此 jsfiddle 中查看正确编码的 html：http: //jsfiddle.net/tchalvakspam/Fp3W6/

Inactive code here:

这里的非活动代码：

<div>
Unescaped:
<br>
<a href="">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</div>

<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct
</div>

Answer 8

回答by user2044453

To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

为了防止这种情况发生，您应该对 urls 进行编码，它在 url 中用一个 % 和一个十六进制数字替换像＆符号这样的字符。

Html 为什么“®”被呈现为“？” 没有分号

提问by Spanky

采纳答案by Alohci

回答by Jukka K. Korpela

回答by jchapa

回答by Salman A

回答by Frank Tudor

回答by jjyepez

回答by Kzqai

Escape your output!

逃离你的输出！

回答by user2044453

相关推荐

最近更新

标签

Html 为什么“®”被呈现为“？” 没有分号

提问by Spanky

采纳答案by Alohci

回答by Jukka K. Korpela

回答by jchapa

回答by Salman A

回答by Frank Tudor

回答by jjyepez

回答by Kzqai

Escape your output!

逃离你的输出！

回答by user2044453

相关推荐

Html target="_blank" 与 target="_new"

HTML iframe - 禁用滚动

Html 如何在IE7模式下保持这两个按钮之间的空间？

Html 高度样式属性在 div 元素中不起作用

相关推荐

最近更新

标签