如何将 PDF 转换为 HTML？

Question

提问by Luchian Grigore

Is there a proper library which I can use to convert PDF to HTML or some other format that can be converted to HTML easily?

是否有合适的库可用于将 PDF 转换为 HTML 或其他一些可以轻松转换为 HTML 的格式？

I searched similar questions, but to no luck.

我搜索了类似的问题，但没有运气。

I want to be able to extract text from PDF's, possibly images. I'm not looking to embed the PDF inside the HTML.

我希望能够从 PDF 中提取文本，可能是图像。我不想在 HTML 中嵌入 PDF。

Answer 1

采纳答案by Siddharth Rout

Like I mentioned in the comment above, it is definitely possible to convert pdf to html using the tool Able2Extract7 which can be downloaded from here

就像我在上面的评论中提到的那样，绝对可以使用可以从这里下载的工具 Able2Extract7 将 pdf 转换为 html

I have been using this tool for almost 2 years now and I am pretty happy with it. This tool lets you convert PDF to Word, Excel, PowerPoint, Publisher, HTML, OO etc. See screenshot

我已经使用这个工具将近 2 年了，我对它非常满意。此工具可让您将 PDF 转换为 Word、Excel、PowerPoint、Publisher、HTML、OO 等。见截图

enter image description here

在此处输入图片说明

Imp Note: This tool is not a freeware.

Imp 注意：此工具不是免费软件。

HTH

Answer 2

回答by moof2k

If you're on Linux, try pdftohtml:

如果您使用的是 Linux，请尝试使用 pdftohtml：

sudo apt-get install poppler-utils
pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html

The open source ebook converter Calibrecan also convert PDF files to HTML and is available on MacOS, Windows and Linux.

开源电子书转换器Calibre还可以将 PDF 文件转换为 HTML，可在 MacOS、Windows 和 Linux 上使用。

Answer 3

回答by Sergio Muriel

Download

下载

pdfbox-2.0.3.jar
fontbox-2.0.3.jar
preflight-2.0.3.jar
xmpbox-2.0.3.jar
pdfbox-tools-2.0.3.jar
pdfbox-debugger-2.0.3.jar

pdfbox-2.0.3.jar
fontbox-2.0.3.jar
preflight-2.0.3.jar
xmpbox-2.0.3.jar
pdfbox-tools-2.0.3.jar
pdfbox-debugger-2.0.3.jar

from http://pdfbox.apache.org/

来自http://pdfbox.apache.org/

 import java.io.InputStream;
 import java.io.IOException;
 import org.apache.pdfbox.pdmodel.PDDocument;
 import org.apache.pdfbox.tools.PDFText2HTML;

    // .....
    try {
        InputStream is = // ..... Read PDF file
        PDDocument pdd = PDDocument.load(is); //This is the in-memory representation of the PDF document.
        PDFText2HTML converter = new PDFText2HTML(); // the converter
        String html = converter.getText(pdd); // That's it!
        pdd.close();
        is.close();
    } catch (IOException ioe) {
        // ......
    }

Please note: Images do not get pushed to the HTML output.

请注意：图像不会被推送到 HTML 输出。

Answer 4

回答by thomasb

It is technically impossible to simply "convert" a PDF file to HTML. The PDF format is more like a "canvas", where you "place" your text blocks and images, whereas HTML needs either CSS or a lot of tables to "place" the blocks. Moreover, PDF files embed the images, whereas HTML simply calls other files.
There are many other examples of differences, but essentially, it's like asking to convert an image or a video with text in it.

从技术上讲，简单地将 PDF 文件“转换”为 HTML 是不可能的。PDF 格式更像是一个“画布”，您可以在其中“放置”文本块和图像，而 HTML 需要 CSS 或大量表格来“放置”这些块。此外，PDF 文件嵌入了图像，而 HTML 只是调用其他文件。
还有许多其他差异示例，但本质上，这就像要求转换包含文本的图像或视频。

You can however read from a PDF file, and then extract the text and images from it, using libraries or other advanced techniques. .Net has a few libraries, for instance : http://forums.asp.net/post/2167442.aspx

但是，您可以读取 PDF 文件，然后使用库或其他高级技术从中提取文本和图像。.Net 有一些库，例如：http: //forums.asp.net/post/2167442.aspx

If you only need to convert one file once, you can open the pdf file in Illustrator for instance, and then export that in html. Or you can select all the document (ctrl+a), copy it, and paste it in Word, and then save the result in html. It will be far from perfect, but it will be a start.

如果您只需要转换一次文件，您可以例如在Illustrator中打开pdf文件，然后将其导出为html。或者你也可以选择所有文档（ctrl+a），复制，粘贴到Word中，然后将结果保存在html中。这远非完美，但这将是一个开始。

Answer 5

回答by Kjk

It's not that difficult to convert PDF to HTML. There are many online options, which may, however, expose your data to third parties. Follow these steps, and the output is great.

将 PDF 转换为 HTML 并不难。有许多在线选项，但是这些选项可能会将您的数据暴露给第三方。按照以下步骤操作，输出很棒。

Open the PDF2HTMLEX page. (You can either follow to next steps which i have mentioned, or follow the directions from the page.)
The package is available for download for Windows from here.
From the many options available, I recommend downloading "pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip (pdf2htmlEx.exe is packed with UPX)"
After downloading and un-zipping conversion is just one cmd command away.
```
C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\abc.pdf
```
Final Command:
```
pdf2htmlEX.exe c:\abc.pdf
```
(You can of course shorten the name of the folder, however, I kept it the same as you would see after un-zipping the download. I am assuming you can change the directory in cmd to the desired folder or else Google how.)

打开PDF2HTMLEX 页面。（您可以按照我提到的后续步骤进行操作，也可以按照页面上的说明进行操作。）
可从此处下载适用于 Windows 的软件包。
从众多可用选项中，我建议下载“pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip（pdf2htmlEx.exe 与 UPX 打包）”
下载和解压缩转换后只需一个 cmd 命令即可。
```
C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\abc.pdf
```
最终命令：
```
pdf2htmlEX.exe c:\abc.pdf
```
（您当然可以缩短文件夹的名称，但是，我将其与解压缩下载后看到的名称保持一致。我假设您可以将 cmd 中的目录更改为所需的文件夹，否则请谷歌。）

abc.pdf will be converted to HTML and will be saved as abc.html in the same folder as that of your exe.

abc.pdf 将被转换为 HTML，并将在与您的 exe 相同的文件夹中保存为 abc.html。

Answer 6

回答by Dmitry Belyaev

Not sure that it can be helpful, but if you need one-time conversion you can try this free on-line tool: https://www.readkong.com/

不确定它是否会有所帮助，但如果您需要一次性转换，您可以试试这个免费的在线工具：https: //www.readkong.com/

Used this site several times. It produces html that is identical to pdf original source. No ugly and broken markup, no html mashup and so on, even for very complex pdf.

多次使用这个网站。它生成与 pdf 原始源相同的 html。没有丑陋和破碎的标记，没有 html mashup 等，即使对于非常复杂的 pdf。

Answer 7

回答by Samir Patel

Yeah it definitely is possible. If your on ubuntu linux

是的，这绝对是可能的。如果您使用的是 ubuntu linux

apt-get install htmltopdf

then

然后

htmltopdf myFile.pdf myFile.htm -c -noframes

If you want to see what all the flags mean then just type

如果您想查看所有标志的含义，只需键入

htmltopdf

If your not on linux, there are a plethora of tools out there that you can use to make this happen.

如果您不在 Linux 上，则可以使用大量工具来实现这一点。

如何将 PDF 转换为 HTML？

提问by Luchian Grigore

采纳答案by Siddharth Rout

回答by moof2k

回答by Sergio Muriel

回答by thomasb

回答by Kjk

回答by Dmitry Belyaev

回答by Samir Patel

相关推荐

最近更新

标签

如何将 PDF 转换为 HTML？

提问by Luchian Grigore

采纳答案by Siddharth Rout

回答by moof2k

回答by Sergio Muriel

回答by thomasb

回答by Kjk

回答by Dmitry Belyaev

回答by Samir Patel

相关推荐

Html Thymeleaf - 布尔运算符

Html 打印时隐藏按钮

使用 HTML 5 Web SQL 数据库时数据存储在哪里

Html 如何更改html中的字体大小？

相关推荐

最近更新

标签