使用 iText 将 HTML 转换为 PDF
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47895935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Converting HTML to PDF using iText
提问by Bruno Lowagie
I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
我发布这个问题是因为许多开发人员以不同的形式或多或少地提出了相同的问题。我自己来回答这个问题(我是iText Group的创始人/CTO),这样它就可以成为一个“维基答案”。如果 Stack Overflow 的“文档”功能仍然存在,那么这将是文档主题的一个很好的候选者。
The source file:
源文件:
I am trying to convert the following HTML file to PDF:
我正在尝试将以下 HTML 文件转换为 PDF:
<html>
<head>
<title>Colossal (movie)</title>
<style>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
</style>
</head>
<body>
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
</div>
<div class="imdb">Read more about this movie on
<a href="www.imdb.com/title/tt4680182">IMDB</a>
</div>
</body>
</html>
In a browser, this HTML looks like this:
在浏览器中,此 HTML 如下所示:
The problems I encountered:
我遇到的问题:
HTMLWorker doesn't take CSS into account at all
HTMLWorker 根本不考虑 CSS
When I used HTMLWorker
, I need to create an ImageProvider
to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet
instance to change some of the styles:
当我使用 时HTMLWorker
,我需要创建一个ImageProvider
以避免出现错误,该错误通知我找不到图像。我还需要创建一个StyleSheet
实例来更改一些样式:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
String.format("resources/html/img/%s",
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(new FileReader("resources/html/sample.html"));
document.close();
}
The result looks like this:
结果如下所示:
For some reason, HTMLWorker
also shows the content of the <title>
tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet
object.
出于某种原因,HTMLWorker
还显示了<title>
标签的内容。我不知道如何避免这种情况。标题中的 CSS 根本没有被解析,我必须使用StyleSheet
对象定义我的代码中的所有样式。
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
当我查看我的代码时,我发现我使用的很多对象和方法都被弃用了:
So I decided to upgrade to using XML Worker.
所以我决定升级到使用 XML Worker。
Images aren't found when using XML Worker
使用 XML Worker 时找不到图像
I tried the following code:
我尝试了以下代码:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
document.close();
}
This resulted in the following PDF:
这导致了以下 PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
使用默认字体 Helvetica 代替 Times-Roman;这是 iText 的典型特征(我应该在我的 HTML 中明确定义一种字体)。否则,CSS 似乎受到尊重,但图像丢失,我没有收到错误消息。
With HTMLWorker
, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider
. Let's see if this works for XML Worker.
使用HTMLWorker
,抛出异常,我能够通过引入ImageProvider
. 让我们看看这是否适用于 XML Worker。
Not all CSS styles are supported in XML Worker
并非所有 CSS 样式都在 XML Worker 中受支持
I adapted my code like this:
我像这样修改了我的代码:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
My code is much longer, but now the image is rendered:
我的代码要长得多,但现在图像已呈现:
The image is larger than when I rendered it using HTMLWorker
which tells me that the CSS attribute width
for the poster
class is taken into account, but the float
attribute is ignored. How do I fix this?
图像比我渲染它时大,HTMLWorker
这告诉我考虑width
了poster
类的 CSS 属性,但float
忽略了该属性。我该如何解决?
The remaining question:
剩下的问题:
So the question boils down to this: I have a specificHTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specificproblem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right
?
所以问题归结为:我有一个特定的HTML 文件,我尝试将其转换为 PDF。我经历了很多工作,一个接一个地解决问题,但有一个我无法解决的具体问题:如何让 iText 尊重定义元素位置的 CSS,例如float: right
?
Additional question:
补充问题:
When my HTML contains form elements (such as <input>
), those form elements are ignored.
当我的 HTML 包含表单元素(例如<input>
)时,这些表单元素将被忽略。
回答by Bruno Lowagie
Why your code doesn't work
为什么你的代码不起作用
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker
has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head>
and a <body>
section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet
class; real CSS wasn't supported.
正如HTML 到 PDF 教程的介绍中所述,HTMLWorker
多年前已被弃用。它不是为了转换完整的 HTML 页面。它不知道 HTML 页面有一个<head>
和一个<body>
部分;它只是解析所有内容。它旨在解析小的 HTML 片段,您可以使用StyleSheet
该类定义样式;不支持真正的 CSS。
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
然后是 XML Worker。XML Worker 旨在作为解析 XML 的通用框架。作为概念证明,我们决定编写一些 XHTML 到 PDF 的功能,但我们并不支持所有的 HTML 标签。例如:根本不支持表单,并且很难支持用于定位内容的 CSS。HTML 中的表单与 PDF 中的表单非常不同。iText 架构与 HTML + CSS 架构之间也存在不匹配。渐渐地,我们扩展了 XML Worker,主要是基于客户的请求,但 XML Worker 变成了一个有很多触角的怪物。
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
最终,我们决定从头开始重写 iText,并考虑到 HTML + CSS 转换的要求。这导致了iText 7。在 iText 7 之上,我们创建了几个附加组件,在此上下文中最重要的一个是pdfHTML。
How to solve the problem
如何解决问题
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
使用最新版本的 iText (iText 7.1.0 + pdfHTML 2.0.0) 将 HTML 从问题转换为 PDF 的代码简化为以下代码段:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
The result looks like this:
结果如下所示:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
如您所见,这几乎是您所期望的结果。从 iText 7.1.0 / pdfHTML 2.0.0 开始,默认字体是 Times-Roman。CSS 正在受到尊重:图像现在浮动在右侧。
Some additional thoughts.
一些额外的想法。
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
当我提出升级到 iText 7 / pdfHTML 2 的建议时,开发人员通常会反对升级到更新的 iText 版本。 请允许我回答我听到的前 3 个论点:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
我需要使用免费的 iText,而 iText 7 不是免费的 / pdfHTML 插件是封闭源代码。
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free usein the sense of free of chargein the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
iText 7 是使用 AGPL 发布的,就像 iText 5 和 XML Worker 一样。该AGPL允许免费使用的感免费的开源项目的背景下。如果您分发的是封闭源代码/专有产品(例如,您在 SaaS 环境中使用 iText),则不能免费使用 iText;在这种情况下,您必须购买商业许可证。这对于 iText 5 来说已经是正确的;iText 7 仍然如此。至于 iText 5 之前的版本:您根本不应该使用这些. 关于 pdfHTML:第一个版本确实只能作为闭源软件使用。我们在 iText Group 内部进行了激烈的讨论:一方面,有些人希望避免公司的大规模滥用,这些公司不听开发人员的意见,因为这些开发人员告诉他们开源不是和免费一样。开发人员告诉我们,他们的老板强迫他们做错事,他们无法说服他们的老板购买商业许可证。另一方面,有些人认为我们不应该因为他们老板的错误行为而惩罚他们。最终,赞成开源 pdfHTML 的人,即 iText 的开发人员赢得了争论。请证明他们没有错,并正确使用 iText:免费;如果您在封闭源环境中使用 iText,请确保您的老板购买了商业许可证。
I need to maintain a legacy system, and I have to use an old iText version.
我需要维护一个遗留系统,我必须使用旧的 iText 版本。
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
严重地?维护还包括应用升级和迁移到您正在使用的软件的新版本。如您所见,使用 iText 7 和 pdfHTML 时所需的代码非常简单,而且比以前所需的代码更不容易出错。迁移项目不应花费太长时间。
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
我才刚刚开始,我不知道 iText 7;我是在完成我的项目后才发现的。
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
这就是我发布这个问题和答案的原因。把自己想象成一个极限程序员。扔掉所有代码,重新开始。您会注意到它的工作量没有您想象的那么多,而且您会睡得更好,因为 iText 5 正在逐步淘汰,因为您已经使您的项目面向未来。我们仍然为付费客户提供支持,但最终,我们将完全停止支持 iText 5。
回答by Abhishek Sengupta
Use iText 7 and this code:
使用 iText 7 和此代码:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
}