SQL - 删除字符串中的所有 HTML 标记

Question

提问by dkmann

In my dataset, I have a field which stores text marked up with HTML. The general format is as follows:

在我的数据集中，我有一个存储用 HTML 标记的文本的字段。一般格式如下：

<html><head></head><body>My text.</body></html>

I could attempt to solve the problem by doing the following:

我可以尝试通过执行以下操作来解决问题：

REPLACE(REPLACE(Table.HtmlData, '<html><head></head><body><p>', ''), '</p></body></html>')

However, this is not a strict rule as some of entries break W3C Standardsand do not include <head>tags for example. Even worse, there could be missing closing tags. So I would need to include the REPLACEfunction for each opening and closing tag that could exist.

然而，这不是一个严格的规则，因为一些条目违反了 W3C 标准并且不包含<head>例如标签。更糟糕的是，可能缺少结束标签。所以我需要REPLACE为每个可能存在的开始和结束标签包含函数。

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    Table.HtmlData,
    '<html>', ''),
    '</html>', ''),
    '<head>', ''),
    '</head>', ''),
    '<body>', ''),
    '</body>', ''),
    '<p>', ''),
    '</p>', '')

I was wondering if there was a better way to accomplish this than using multiple nested REPLACEfunctions. Unfortunately, the only languages I have available in this environment are SQL and Visual Basic (not .NET).

我想知道是否有比使用多个嵌套REPLACE函数更好的方法来实现这一点。不幸的是，我在这个环境中唯一可用的语言是 SQL 和 Visual Basic（不是 .NET）。

Answer 1

采纳答案by Devart

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'

SELECT t.c.value('.', 'NVARCHAR(MAX)')
FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

更新 - 对于带有未闭合标签的字符串：

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'

SELECT x.value('.', 'NVARCHAR(MAX)')
FROM (
    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)
) r

Answer 2

回答by LukStorms

If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

如果 HTML 格式正确，则无需使用替换来解析 XML。
只需将其强制转换或转换为 XML 类型并获取值。

Here's an example to output the text from all tags:

这是从所有标签输出文本的示例：

declare @htmlData nvarchar(100) = '<html>
<head>
</head>
<body>
   <p>My text.</p>
   <p>My other text.</p>
</body>
</html>';

select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');

select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

请注意，cast 和 convert 之间的空白输出是不同的。

To only get content from a specific node, the XQuerysyntax is used. (XQuery is based on the XPath syntax)

为了仅从特定节点获取内容，使用了XQuery语法。（XQuery 基于 XPath 语法）

For example:

例如：

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');

select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

结果： My text.

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsingerror.

当然，这仍然假设有效的 XML。
例如，如果缺少结束标记，则会引发XML parsing错误。

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

如果 HTML 作为 XML 格式不正确，则可以使用 PATINDEX & SUBSTRING 来获取第一个 p 标签。然后将其转换为 XML 类型以获取值。

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

或通过时髦的递归方式：

declare @xmlData nvarchar(100);
WITH Lines(n, x, y) AS (
  SELECT 1, 1, CHARINDEX(char(13), @htmlData)
  UNION ALL
  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines
  WHERE y > 0
)
SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))
FROM Lines
where PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0
order by n;

select 
@xmlData as xmlData, 
convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;

Answer 3

回答by cp50

Firstly create a user defined function that strips the HTML out like so:

首先创建一个用户定义的函数，像这样去除 HTML：

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
     BEGIN
         DECLARE @Start INT;
         DECLARE @End INT;
         DECLARE @Length INT;
         SET @Start = CHARINDEX('<', @HTMLText);
         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
         SET @Length = (@End - @Start) + 1;
         WHILE @Start > 0
               AND @End > 0
               AND @Length > 0
             BEGIN
                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');
                 SET @Start = CHARINDEX('<', @HTMLText);
                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));
                 SET @Length = (@End - @Start) + 1;
             END;
         RETURN LTRIM(RTRIM(@HTMLText));
     END;
GO

When you're trying to select it:

当您尝试选择它时：

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

这应该会导致您避免使用多个嵌套的替换语句。

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

信用和更多信息：http: //blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

Answer 4

回答by Shnugo

One more solution, just to demonstrate a trick to replace many values of a table (easy to maintain!!!) in one single statement:

另一种解决方案，只是为了演示在一个语句中替换表的许多值（易于维护！！！）的技巧：

--add any replace templates here:

--在此处添加任何替换模板：

CREATE TABLE ReplaceTags (HTML VARCHAR(100));
INSERT INTO ReplaceTags VALUES
 ('<html>'),('<head>'),('<body>'),('<p>'),('<br>')
,('</html>'),('</head>'),('</body>'),('</p>'),('</br>');
GO

--This function will perform the "trick"

--此函数将执行“技巧”

CREATE FUNCTION dbo.DoReplace(@Content VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    SELECT @Content=REPLACE(@Content,HTML,'')
    FROM ReplaceTags;

    RETURN @Content;
END
GO

--All examples I found in your question and in comments

- 我在您的问题和评论中找到的所有示例

DECLARE @content TABLE(Content VARCHAR(MAX));
INSERT INTO @content VALUES
 ('<html><head></head><body><p>My text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.<br>More text.</p></body></html>')
,('<html><head></head><body><p>My text.</p></html>');

--this is the actual query

--这是实际查询

SELECT dbo.DoReplace(Content) FROM @content;
GO

--Clean-Up

- 清理

DROP FUNCTION dbo.DoReplace;
DROP TABLE ReplaceTags;

UPDATE

更新

If you add a replace-value to the template-table you might even use different values as replacements like replace a  with an actual line break...

如果您向模板表添加替换值，您甚至可以使用不同的值作为替换值，例如用 实际换行符替换 a ...

Answer 5

回答by StackUser

This is the simplest way.

这是最简单的方法。

DECLARE @str VARCHAR(299)

SELECT @str = '<html><head></head><body><p>My text.</p></body></html>'

SELECT cast(@str AS XML).query('.').value('.', 'varchar(200)')

Answer 6

回答by Ranjana Ghimire

This is just an example. You can use this in script to rmeove any html tags:

这只是一个例子。您可以在脚本中使用它来重新添加任何 html 标签：

 DECLARE @VALUE VARCHAR(MAX),@start INT,@end int,@remove varchar(max)
SET @VALUE='<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>sql - Converting INT to DATE then using GETDATE on conversion? - Stack Overflow</title>
<html>
</html>
'

set @start=charindex('<',@value)
while @start>0
begin
set @end=charindex('>',@VALUE)

set @remove=substring(@VALUE,@start,@end)
set @value=replace(@value,@remove,'')
set @start=charindex('<',@value)
end
print @value

Answer 7

回答by Zsuzsa

You mention the XML is not always valid, but does it always contain the and tags?

您提到 XML 并不总是有效，但它是否总是包含 和 标签？

In that case the following would work:

在这种情况下，以下内容将起作用：

SUBSTRING(Table.HtmlData, 
    CHARINDEX('<p>', Table.HtmlData) + 1, 
    CHARINDEX('</p>', Table.HtmlData) - CHARINDEX('<p>', Table.HtmlData) + 1)

For finding all positions of a within a HTML, there's already a good post here: https://dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-within-another-string

为了在 HTML 中查找 的所有位置，这里已经有一个很好的帖子：https: //dba.stackexchange.com/questions/41961/how-to-find-all-positions-of-a-string-另一个字符串内

Alternatively I suggest using Visual Basic, as you mentioned that is also an option.

或者，我建议使用 Visual Basic，正如您提到的，这也是一种选择。

SQL - 删除字符串中的所有 HTML 标记

提问by dkmann

采纳答案by Devart

回答by LukStorms

回答by cp50

回答by Shnugo

UPDATE

更新

回答by StackUser

回答by Ranjana Ghimire

回答by Zsuzsa

相关推荐

最近更新

标签

SQL - 删除字符串中的所有 HTML 标记

提问by dkmann

采纳答案by Devart

回答by LukStorms

回答by cp50

回答by Shnugo

UPDATE

更新

回答by StackUser

回答by Ranjana Ghimire

回答by Zsuzsa

相关推荐

Html 使 Bootstrap 按钮透明

Html 删除表格行之间的空间

Html 如何将 CSS 和样式应用到 React 组件

Html 在 CSS 中绘制一个 X

相关推荐

最近更新

标签