如何将 PDF 转换为 HTML?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8370014/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert PDF to HTML?
提问by Luchian Grigore
Is there a proper library which I can use to convert PDF to HTML or some other format that can be converted to HTML easily?
是否有合适的库可用于将 PDF 转换为 HTML 或其他一些可以轻松转换为 HTML 的格式?
I searched similar questions, but to no luck.
我搜索了类似的问题,但没有运气。
I want to be able to extract text from PDF's, possibly images. I'm not looking to embed the PDF inside the HTML.
我希望能够从 PDF 中提取文本,可能是图像。我不想在 HTML 中嵌入 PDF。
采纳答案by Siddharth Rout
Like I mentioned in the comment above, it is definitely possible to convert pdf to html using the tool Able2Extract7 which can be downloaded from here
就像我在上面的评论中提到的那样,绝对可以使用可以从这里下载的工具 Able2Extract7 将 pdf 转换为 html
I have been using this tool for almost 2 years now and I am pretty happy with it. This tool lets you convert PDF to Word, Excel, PowerPoint, Publisher, HTML, OO etc. See screenshot
我已经使用这个工具将近 2 年了,我对它非常满意。此工具可让您将 PDF 转换为 Word、Excel、PowerPoint、Publisher、HTML、OO 等。见截图
Imp Note: This tool is not a freeware.
Imp 注意:此工具不是免费软件。
HTH
HTH
回答by moof2k
If you're on Linux, try pdftohtml:
如果您使用的是 Linux,请尝试使用 pdftohtml:
sudo apt-get install poppler-utils
pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html
The open source ebook converter Calibrecan also convert PDF files to HTML and is available on MacOS, Windows and Linux.
开源电子书转换器Calibre还可以将 PDF 文件转换为 HTML,可在 MacOS、Windows 和 Linux 上使用。
回答by Sergio Muriel
Download
下载
- pdfbox-2.0.3.jar
- fontbox-2.0.3.jar
- preflight-2.0.3.jar
- xmpbox-2.0.3.jar
- pdfbox-tools-2.0.3.jar
- pdfbox-debugger-2.0.3.jar
- pdfbox-2.0.3.jar
- fontbox-2.0.3.jar
- preflight-2.0.3.jar
- xmpbox-2.0.3.jar
- pdfbox-tools-2.0.3.jar
- pdfbox-debugger-2.0.3.jar
from http://pdfbox.apache.org/
import java.io.InputStream;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.tools.PDFText2HTML;
// .....
try {
InputStream is = // ..... Read PDF file
PDDocument pdd = PDDocument.load(is); //This is the in-memory representation of the PDF document.
PDFText2HTML converter = new PDFText2HTML(); // the converter
String html = converter.getText(pdd); // That's it!
pdd.close();
is.close();
} catch (IOException ioe) {
// ......
}
Please note: Images do not get pushed to the HTML output.
请注意:图像不会被推送到 HTML 输出。
回答by thomasb
It is technically impossible to simply "convert" a PDF file to HTML. The PDF format is more like a "canvas", where you "place" your text blocks and images, whereas HTML needs either CSS or a lot of tables to "place" the blocks. Moreover, PDF files embed the images, whereas HTML simply calls other files.
There are many other examples of differences, but essentially, it's like asking to convert an image or a video with text in it.
从技术上讲,简单地将 PDF 文件“转换”为 HTML 是不可能的。PDF 格式更像是一个“画布”,您可以在其中“放置”文本块和图像,而 HTML 需要 CSS 或大量表格来“放置”这些块。此外,PDF 文件嵌入了图像,而 HTML 只是调用其他文件。
还有许多其他差异示例,但本质上,这就像要求转换包含文本的图像或视频。
You can however read from a PDF file, and then extract the text and images from it, using libraries or other advanced techniques. .Net has a few libraries, for instance : http://forums.asp.net/post/2167442.aspx
但是,您可以读取 PDF 文件,然后使用库或其他高级技术从中提取文本和图像。.Net 有一些库,例如:http: //forums.asp.net/post/2167442.aspx
If you only need to convert one file once, you can open the pdf file in Illustrator for instance, and then export that in html. Or you can select all the document (ctrl+a), copy it, and paste it in Word, and then save the result in html. It will be far from perfect, but it will be a start.
如果您只需要转换一次文件,您可以例如在Illustrator中打开pdf文件,然后将其导出为html。或者你也可以选择所有文档(ctrl+a),复制,粘贴到Word中,然后将结果保存在html中。这远非完美,但这将是一个开始。
回答by Kjk
It's not that difficult to convert PDF to HTML. There are many online options, which may, however, expose your data to third parties. Follow these steps, and the output is great.
将 PDF 转换为 HTML 并不难。有许多在线选项,但是这些选项可能会将您的数据暴露给第三方。按照以下步骤操作,输出很棒。
Open the PDF2HTMLEX page. (You can either follow to next steps which i have mentioned, or follow the directions from the page.)
The package is available for download for Windows from here.
From the many options available, I recommend downloading "pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip (pdf2htmlEx.exe is packed with UPX)"
After downloading and un-zipping conversion is just one cmd command away.
C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\abc.pdf
Final Command:
pdf2htmlEX.exe c:\abc.pdf
(You can of course shorten the name of the folder, however, I kept it the same as you would see after un-zipping the download. I am assuming you can change the directory in cmd to the desired folder or else Google how.)
打开PDF2HTMLEX 页面。(您可以按照我提到的后续步骤进行操作,也可以按照页面上的说明进行操作。)
从众多可用选项中,我建议下载“pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip(pdf2htmlEx.exe 与 UPX 打包)”
下载和解压缩转换后只需一个 cmd 命令即可。
C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\abc.pdf
最终命令:
pdf2htmlEX.exe c:\abc.pdf
(您当然可以缩短文件夹的名称,但是,我将其与解压缩下载后看到的名称保持一致。我假设您可以将 cmd 中的目录更改为所需的文件夹,否则请谷歌。)
abc.pdf will be converted to HTML and will be saved as abc.html in the same folder as that of your exe.
abc.pdf 将被转换为 HTML,并将在与您的 exe 相同的文件夹中保存为 abc.html。
回答by Dmitry Belyaev
Not sure that it can be helpful, but if you need one-time conversion you can try this free on-line tool: https://www.readkong.com/
不确定它是否会有所帮助,但如果您需要一次性转换,您可以试试这个免费的在线工具:https: //www.readkong.com/
Used this site several times. It produces html that is identical to pdf original source. No ugly and broken markup, no html mashup and so on, even for very complex pdf.
多次使用这个网站。它生成与 pdf 原始源相同的 html。没有丑陋和破碎的标记,没有 html mashup 等,即使对于非常复杂的 pdf。
回答by Samir Patel
Yeah it definitely is possible. If your on ubuntu linux
是的,这绝对是可能的。如果您使用的是 ubuntu linux
apt-get install htmltopdf
then
然后
htmltopdf myFile.pdf myFile.htm -c -noframes
If you want to see what all the flags mean then just type
如果您想查看所有标志的含义,只需键入
htmltopdf
If your not on linux, there are a plethora of tools out there that you can use to make this happen.
如果您不在 Linux 上,则可以使用大量工具来实现这一点。