如何以编程方式(或使用工具)将 .MHT mhtml 文件转换为常规 HTML 和 CSS 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16203002/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 07:50:44  来源:igfitidea点击:

How can you programmatically (or with a tool) convert .MHT mhtml files to regular HTML and CSS files?

htmlconvertermhtml

提问by klumsy

Many tools have a way to export a .MHT file. I want a way to convert that single file to a collection of files, an HTML file, the relevant images, and CSS files, that I could then upload to a webhost and be consumable by all browsers. Does anybody know any tools or libraries or algorithms to do this.

许多工具都有导出 .MHT 文件的方法。我想要一种将单个文件转换为文件集合、HTML 文件、相关图像和 CSS 文件的方法,然后我可以将其上传到网络主机并可供所有浏览器使用。有没有人知道任何工具、库或算法来做到这一点。

采纳答案by XNargaHuntress

Well, you can open the .MHT file in IE and the Save it as a a web page. I tested this with this page, and even though it looked odd in IE (it's IE after all), it saved and then opened fine in Chrome (as in, it looked like it should).

那么,您可以在 IE 中打开 .MHT 文件并将其另存为网页。我用这个页面对此进行了测试,尽管它在 IE 中看起来很奇怪(毕竟它是 IE),但它保存并在 Chrome 中正常打开(就像它看起来应该的那样)。

Barring that method, looking at the file itself, text blocks are saved in the file as-is, and all other content is saved in Base64. Each item of content is preceded by:

除了这种方法,查看文件本身,文本块按原样保存在文件中,所有其他内容都保存在 Base64 中。每项内容前面都有:

[Boundary]
Content-Type: [Mime Type]
Content-Transfer-Encoding: [Encoding Type]
Content-Location: [Full path of content]

Where [Mime Type], [Encoding Type], and [Full path of content]are variable. [Encoding Type]appears to be either base64or quoted-printable. [Boundary]is defined in the beginning of the .MHT file like so:

其中[Mime Type][Encoding Type][Full path of content]是可变的。[编码类型]似乎是base64Quoted-printable[Boundary]定义在 .MHT 文件的开头,如下所示:

From: <Saved by WebKit>
Subject: converter - How can you programmatically (or with a tool) convert .MHT mhtml        files to regular HTML and CSS files? - Stack Overflow
Date: Fri, 9 May 2013 13:53:36 -0400
MIME-Version: 1.0
Content-Type: multipart/related;
    type="text/html";
    boundary="----=_NextPart_000_0C08_58653ABB.B67612B7"

Using that, you could make your own file parser if needed.

使用它,您可以根据需要制作自己的文件解析器。

回答by sahwar

Besides IE and MS Word, there's an open-source cross-platform program called 'mht2html' first written in 2007and last updated in 2016. It has both a GUI and terminal interface.

除了 IE 和 MS Word,还有一个名为“mht2html”的开源跨平台程序,该程序于2007 年首次编写并于2016最后一次更新。它具有 GUI 和终端界面。

I haven't tested it yet but it seems to have received good reviews.

我还没有测试它,但它似乎收到了很好的评价。

回答by Zagavarr

MHT file is essentially MIME. So, it's possible to use Chilkat.Mime or completely free System.Net.Mime components to access its internal structure. If, for example, MHT contains images, they can be replaced with base64 strings in the output HTML.

MHT 文件本质上是 MIME。因此,可以使用 Chilkat.Mime 或完全免费的 System.Net.Mime 组件来访问其内部结构。例如,如果 MHT 包含图像,则可以在输出 HTML 中用 base64 字符串替换它们。

Imports HtmlAgilityPack
Imports Fizzler.Systems.HtmlAgilityPack
Imports Chilkat
Public Function ConvertMhtToHtml(ByVal mhtFile As String) As String
    Dim chilkatWholeMime As New Chilkat.Mime
    'Load mime'
    chilkatWholeMime.LoadMimeFile(mhtFile)
    'Get html string, which is 1-st part of mime'
    Dim html As String = chilkatWholeMime.GetPart(0).GetBodyDecoded
    'Create collection for storing url of images and theirs base64 representations'
    Dim allImages As New Specialized.NameValueCollection
    'Iterate through mime parts'
    For i = 1 To chilkatWholeMime.NumParts - 1
        Dim m As Chilkat.Mime = chilkatWholeMime.GetPart(i)
        'See if it is image'
        If m.IsImage AndAlso m.Encoding = "base64" Then
            allImages.Add(m.GetHeaderField("Content-Location"), "data:" + m.ContentType + ";base64," + m.GetBodyEncoded)
        End If : m.Dispose()
    Next : chilkatWholeMime.Dispose()
    'Now it is time to replace the source attribute of all images in HTML with dataURI'
    Dim htmlDoc As New HtmlDocument : htmlDoc.LoadHtml(html) : Dim docNode As HtmlNode = htmlDoc.DocumentNode
    For i = 0 To allImages.Count - 1
        'Select all images, whose src attribute is equal to saved URL'
        Dim keyURL As String = allImages.GetKey(i) 'Saved url from MHT'
        Dim elementsWithPics() As HtmlNode = docNode.QuerySelectorAll("img[src='" + keyURL + "']").ToArray
        Dim imgsrc As String = allImages.GetValues(i)(0) 'dataURI as base64 string'
        For j = 0 To elementsWithPics.Length - 1
            elementsWithPics(j).SetAttributeValue("src", imgsrc)
        Next
        'Select all elements, whose style attribute contains saved URL'
        elementsWithPics = docNode.QuerySelectorAll("[style~='" + keyURL + "']").ToArray
        For j = 0 To elementsWithPics.Length - 1
            'Get and modify style'
            Dim modStyle As String = Strings.Replace(elementsWithPics(j).GetAttributeValue("style", String.Empty), keyURL, imgsrc, 1, 1, 1)
            elementsWithPics(j).SetAttributeValue("style", modStyle)
        Next : Erase elementsWithPics
    Next
    'Get final html'
    Dim tw As New StringWriter()
    htmlDoc.Save(tw) : html = tw.ToString : tw.Close() : tw.Dispose()
    Return html
End Function

回答by CaptainBli

I think that @XGundam05 is correct. Here is what I did to make it work.

我认为@XGundam05 是正确的。这是我为使其工作所做的工作。

I started with a Windows Form project in Visual Studio. Added the WebBrowser to the form and then added two buttons. Then this code:

我从 Visual Studio 中的 Windows 窗体项目开始。将 WebBrowser 添加到表单中,然后添加了两个按钮。然后这段代码:

    private void button1_Click(object sender, EventArgs e)
    {
        webBrowser1.ShowSaveAsDialog();
    }

    private void button2_Click(object sender, EventArgs e)
    {
        webBrowser1.Url = new Uri("localfile.mht");
    }

You should be able to take this code and add in a list of files and process each one with a foreach. The webBrowsercontains a method called ShowSaveAsDialog(); And this will allow one to save as .mht or just the html or the complete page.

您应该能够使用此代码并添加文件列表并使用foreach. 的webBrowser包含一个称为方法ShowSaveAsDialog(); 这将允许一个保存为 .mht 或只是 html 或完整的页面。

EDIT: You could use the webBrowser's Document and scrape the information at this point. By adding a richTextBox and a public variable as per MS here: http://msdn.microsoft.com/en-us/library/ms171713.aspx

编辑:此时您可以使用 webBrowser 的文档并抓取信息。通过在此处添加一个 RichTextBox 和一个公共变量:http: //msdn.microsoft.com/en-us/library/ms171713.aspx

    public string Code
    {
        get
        {
            if (richTextBox1.Text != null)
            {
                return (richTextBox1.Text);
            }
            else
            {
                return ("");
            }
        }
        set
        {
            richTextBox1.Text = value;
        }
    }


    private void button2_Click(object sender, EventArgs e)
    {
        webBrowser1.Url = new Uri("localfile.mht");
        HtmlElement elem;

        if (webBrowser1.Document != null)
        {

            HtmlElementCollection elems = webBrowser1.Document.GetElementsByTagName("HTML");
            if (elems.Count == 1)
            {
                elem = elems[0];
                Code = elem.OuterHtml;
                foreach (HtmlElement elem1 in elems)
                {
                    //look for pictures to save
                }

            }
        }
    }

回答by klumsy

So automating IE was difficult and not usable end to end, so I think building some sort of code that does it will be the way to go. on github I found this python one which may be good

所以自动化 IE 很困难,而且不能端到端地使用,所以我认为构建某种代码来做到这一点将是要走的路。在 github 上我发现了这个可能不错的 python

https://github.com/Modified/MHTifierhttp://decodecode.net/elitist/2013/01/mhtifier/

https://github.com/Modified/MHTifier http://decodecode.net/elitist/2013/01/mhtifier/

If I have time i'll try to do something similar in PowerShell.

如果我有时间,我会尝试在 PowerShell 中做类似的事情。

回答by kyb

Firefoxhas embedded tool. Go to menu (press Alt if hidden) File->Convert saved pages.

Firefox有嵌入式工具。转到菜单(如果隐藏,请按 Alt)File->Convert saved pages

回答by Knight

Step 1 : Open the .MHT / .MHTML file in Browser.

第 1 步:在浏览器中打开 .MHT / .MHTML 文件。

Step 2 : Right click to select to view the source code.

第二步:右击选择查看源代码。

Step 3: Copy the source code and paste it to a new .TXT file, then change the file extension to .HTML.

第 3 步:复制源代码并将其粘贴到一个新的 .TXT 文件中,然后将文件扩展名更改为 .HTML。