SQL - 删除字符串中的所有 HTML 标记
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38868364/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL - Remove all HTML tags in a string
提问by dkmann
In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:
在我的数据集中,我有一个存储用 HTML 标记的文本的字段。一般格式如下:
<html><head></head><body><p>My text.</p></body></html>
<html><head></head><body><p>My text.</p></body></html>
I could attempt to solve the problem by doing the following:
我可以尝试通过执行以下操作来解决问题:
REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')
However, this is not a strict rule as some of entries break W3C Standardsand do not include <head>
tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACE
function for each opening and closing tag that could exist.
然而,这不是一个严格的规则,因为一些条目违反了 W3C 标准并且不包含<head>
例如标签。更糟糕的是,可能缺少结束标签。所以我需要REPLACE
为每个可能存在的开始和结束标签包含函数。
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
Table.HtmlData,
'<html>', ''),
'</html>', ''),
'<head>', ''),
'</head>', ''),
'<body>', ''),
'</body>', ''),
'<p>', ''),
'</p>', '')
I was wondering if there was a better way to accomplish this than using multiple nested REPLACE
functions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).
我想知道是否有比使用多个嵌套REPLACE
函数更好的方法来实现这一点。不幸的是,我在这个环境中唯一可用的语言是 SQL 和 Visual Basic(不是 .NET)。
采纳答案by Devart
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'
SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)
Update - For strings with unclosed tags:
更新 - 对于带有未闭合标签的字符串:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'
SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r
回答by LukStorms
If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).
如果 HTML 格式正确,则无需使用替换来解析 XML。
只需将其强制转换或转换为 XML 类型并获取值。
Here's an example to output the text from all tags:
这是从所有标签输出文本的示例:
declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
<p>My text.</p>
<p>My other text.</p>
</body>
</html>';
select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');
select cast(@htmlData as XML).value('.', 'nvarchar(max)');
Note that there's a difference in the output of whitespace between cast and convert.
请注意,cast 和 convert 之间的空白输出是不同的。
To only get content from a specific node, the XQuerysyntax is used. (XQuery is based on the XPath syntax)
为了仅从特定节点获取内容,使用了XQuery语法。(XQuery 基于 XPath 语法)
For example:
例如:
select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');
select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');
Result : My text.
结果 : My text.
Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing
error.
当然,这仍然假设有效的 XML。
例如,如果缺少结束标记,则会引发XML parsing
错误。
If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.
如果 HTML 作为 XML 格式不正确,则可以使用 PATINDEX & SUBSTRING 来获取第一个 p 标签。然后将其转换为 XML 类型以获取值。
select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');
or via a funky recursive way:
或通过时髦的递归方式:
declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
SELECT 1, 1, CHARINDEX(char(13), @htmlData)
UNION ALL
SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;
select
@xmlData as xmlData,
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;
回答by cp50
Firstly create a user defined function that strips the HTML out like so:
首先创建一个用户定义的函数,像这样去除 HTML:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT;
DECLARE @End INT;
DECLARE @Length INT;
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
SET @Start = CHARINDEX('<', @HTMLText);
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
SET @Length = (@End - @Start) + 1;
END;
RETURN LTRIM(RTRIM(@HTMLText));
END;
GO
When you're trying to select it:
当您尝试选择它时:
SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE
This should lead to you avoiding to have to use several nested replace statements.
这应该会导致您避免使用多个嵌套的替换语句。
Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
信用和更多信息:http: //blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
回答by Shnugo
One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:
另一种解决方案,只是为了演示在一个语句中替换表的许多值(易于维护!!!)的技巧:
--add any replace templates here:
--在此处添加任何替换模板:
CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO
--This function will perform the "trick"
--此函数将执行“技巧”
CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
SELECT @Content=REPLACE(@Content,HTML,'')
FROM ReplaceTags;
RETURN @Content;
END
GO
--All examples I found in your question and in comments
- 我在您的问题和评论中找到的所有示例
DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');
--this is the actual query
--这是实际查询
SELECT dbo.DoReplace(Content) FROM @content;
GO
--Clean-Up
- 清理
DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;
UPDATE
更新
If you add a replace-value to the template-table you might even use different values as replacements like replace a <br>
with an actual line break...
如果您向模板表添加替换值,您甚至可以使用不同的值作为替换值,例如用<br>
实际换行符替换 a ...
回答by StackUser
This is the simplest way.
这是最简单的方法。
DECLARE @str VARCHAR(299)
SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'
SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')
回答by Ranjana Ghimire
This is just an example. You can use this in script to rmeove any html tags:
这只是一个例子。您可以在脚本中使用它来重新添加任何 html 标签:
DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>
<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'
set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)
set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value
回答by Zsuzsa
You mention the XML is not always valid, but does it always contain the <p> and </p> tags?
您提到 XML 并不总是有效,但它是否总是包含 <p> 和 </p> 标签?
In that case the following would work:
在这种情况下,以下内容将起作用:
SUBSTRING(Table.HtmlData,
CHARINDEX('<p>', Table.HtmlData) + 1,
CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)
For finding all positions of a <p> within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string
为了在 HTML 中查找 <p> 的所有位置,这里已经有一个很好的帖子:https: //dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-另一个字符串内
Alternatively I suggest using Visual Basic, as you mentioned that is also an option.
或者,我建议使用 Visual Basic,正如您提到的,这也是一种选择。