将pdf文件转换为C#中的文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1944576/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert a pdf file to text in C#
提问by aharon
I need to convert a .pdf file to a .txt file (or .doc, but I prefer .txt).
我需要将 .pdf 文件转换为 .txt 文件(或 .doc,但我更喜欢 .txt)。
How can I do this in C#?
我怎样才能在 C# 中做到这一点?
采纳答案by serge_gubenko
Ghostscriptcould do what you need. Below is a command for extracting text from a pdf file into a txt file (you can run it from a command line to test if it works for you):
Ghostscript可以满足您的需求。下面是一个从 pdf 文件中提取文本到 txt 文件的命令(你可以从命令行运行它来测试它是否适合你):
gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "test.pdf" -c quit >"test.txt"
Check here: codeproject: Convert PDF to Image Using Ghostscript APIfor details on how to use ghostscript with C#
在此处查看:codeproject:使用 Ghostscript API 将 PDF 转换为图像,了解有关如何在 C# 中使用 ghostscript 的详细信息
回答by Zaid Amir
The concept of converting PDF to text is not really straight forward and you wont see anyone posting a code here that will convert PDF to text straight. So your best bet now is to use a library that would do the job for you... a good one is PDFBox, you can google it. You'll probably find it written in java but fortunately you can use IKVM to convert it to .Net....
将 PDF 转换为文本的概念并不是很简单,您不会看到任何人在此处发布将 PDF 直接转换为文本的代码。所以你现在最好的选择是使用一个可以为你完成这项工作的库......一个好的是PDFBox,你可以谷歌它。您可能会发现它是用 java 编写的,但幸运的是您可以使用 IKVM 将其转换为 .Net ....
回答by Don
I've had the need myself and I used this article to get me started: http://www.codeproject.com/KB/string/pdf2text.aspx
我自己也有需要,我用这篇文章让我开始:http: //www.codeproject.com/KB/string/pdf2text.aspx
回答by Justin
As an alternative to Don's solution there I found the following:
作为 Don 解决方案的替代方案,我发现了以下内容:
回答by Bobrovsky
Docotic.Pdf librarycan extract text from PDF files (formatted or not).
Docotic.Pdf 库可以从 PDF 文件(格式化或未格式化)中提取文本。
Here is a sample code that shows how to extract formatted text from a PDF file and save it to an other file.
这是一个示例代码,展示了如何从 PDF 文件中提取格式化文本并将其保存到其他文件。
public static void ExtractFormattedText(string pdfFile, string textFile)
{
using (PdfDocument doc = new PdfDocument(pdfFile))
{
string text = doc.GetTextWithFormatting();
File.WriteAllText(textFile, text);
}
}
Also, there is a sample on our site that shows other options for extraction of text from PDF files.
此外,我们网站上还有一个示例,其中显示了从 PDF 文件中提取文本的其他选项。
Disclaimer: I work for Bit Miracle, vendor of the library.
免责声明:我为该库的供应商 Bit Miracle 工作。
回答by shuvo sarker
public void PDF_TEXT()
{
richTextBox1.Text = string.Empty;
ReadPdfFile(@"C:\Myfile.pdf"); //read pdf file from location
}
public void ReadPdfFile(string fileName)
{
string strText = string.Empty;
StringBuilder text = new StringBuilder();
try
{
PdfReader reader = new PdfReader((string)fileName);
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
richTextBox1.Text = text.ToString();
}
private void Save_TextFile_Click(object sender, EventArgs e)
{
SaveFileDialog sfd = new SaveFileDialog();
DialogResult messageResult = MessageBox.Show("Save this file into Text?", "Text File", MessageBoxButtons.OKCancel);
if (messageResult == DialogResult.Cancel)
{
}
else
{
sfd.Title = "Save As Textfile";
sfd.InitialDirectory = @"C:\";
sfd.Filter = "TextDocuments|*.txt";
if (sfd.ShowDialog() == DialogResult.OK)
{
if (richTextBox1.Text != "")
{
richTextBox1.SaveFile(sfd.FileName, RichTextBoxStreamType.PlainText);
richTextBox1.Text = "";
MessageBox.Show("Text Saved Succesfully", "Text File");
}
else
{
MessageBox.Show("Please Upload Your Pdf", "Text File",
MessageBoxButtons.OKCancel, MessageBoxIcon.Asterisk);
}
}
}
}