如何使用 C# 处理 CSV 文件中的换行符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1179157/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I handle line breaks in a CSV file using C#?
提问by user144658
I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
我将 Excel 电子表格转换为 C# 中的 CSV 文件,但在处理换行符时遇到问题。例如:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
当我读取 CSV 文件时,如果记录不以双引号 (") 开头,则错误地存在换行符,我必须将其删除。我有一些来自互联网的 CSV 阅读器类,但我担心它们将在换行符上失败。
How should I handle these line breaks?
我应该如何处理这些换行符?
Thanks everybody very much for your help.
非常感谢大家的帮助。
Here's is what I've done so far. My records have fixed format and all start with
这是我到目前为止所做的。我的记录有固定格式,并且都以
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ;
in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
所以我检查了;
每行的 [3] 位置。如果是真的,我写;如果为假,我将附加在最后一个(删除换行符)
I'm having problems now because I'm saving the file as a txt.
我现在遇到问题,因为我将文件另存为 txt。
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
顺便说一下,我通过在 Excel 中另存为 csv 将 Excel 电子表格转换为 csv。但我不确定客户是否这样做。
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
所以作为TXT的文件是完美的。我检查了记录和总数。但是现在我必须将它转换回csv,我真的很想在程序中这样做。有人知道怎么做吗?
Here is my code:
这是我的代码:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
回答by Freddy
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
也许您可以在 ReadLine() 期间计算 (")。如果它们是奇数,则会引发标志。您可以忽略这些行,或者获取接下来的两行并消除合并行的第一个 "\n" 出现.
回答by Doug
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
与其检查当前行是否缺少 (") 作为第一个字符,不如检查最后一个字符是否为 (")。如果不是,您就知道有一个换行符,您可以阅读下一行并将其合并在一起。
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
我假设您的示例数据是准确的 - 字段用引号括起来。如果引号可能无法分隔文本字段(或在非文本数据中以某种方式找到换行符),那么所有赌注都将关闭!
回答by Doug
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read. Part 2is the parser, and there is a Part 1that covers the writer portion.
有一个示例解析器是 c#,它似乎可以正确处理您的情况。然后,您可以读取数据并在读取后清除其中的换行符。 第 2 部分是解析器,第 1部分涵盖了编写器部分。
回答by FlappySocks
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
阅读该行。
拆分为列(字段)。
如果每行预期有足够的列,则进行处理。
如果没有,请阅读下一行,并捕获剩余的列,直到获得所需的内容。
重复。
回答by Michael La Voie
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
CSV 有预定义的处理方式。该站点提供了一个易于阅读的解释,说明了处理 CSV 的所有警告的标准方法。
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSVis my favorite library for this. It supports reading and writing in a clean and simple way.
尽管如此,确实没有理由不使用可靠的开源库来读取和写入 CSV 文件以避免犯非标准错误。 LINQtoCSV是我最喜欢的库。它支持以简洁明了的方式进行读写。
Alternatively, this SO question on CSV librarieswill give you the list of the most popular choices.
或者,这个关于 CSV 库的 SO 问题将为您提供最受欢迎的选择列表。
回答by John Fisher
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
每行都可以使用一个稍微简单的正则表达式。当它匹配时,您处理匹配中的每个字段。如果没有找到匹配项,则跳过该行。
The regular expression could look something like this.
正则表达式可能看起来像这样。
Match match = Regex.Match(line, @"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
回答by John
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
由于这个问题,我通常做的是逐个字符地阅读文本,而不是逐行阅读。
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
当您阅读每个字符时,您应该能够找出每个单元格的开始和停止位置,以及一行中的换行符和单元格中的换行符之间的区别:如果我没记错的话,无论如何,对于 Excel 生成的文件,行开始与\r\n,并且单元格中的换行符仅为\r。
回答by Judah Gabriel Himango
Heed the advice from the experts and Don't roll your own CSV parser.
听从专家的建议,不要使用自己的 CSV 解析器。
Your first thought is, "How do I handle new line breaks?"
你的第一个想法是,“我如何处理换行符?”
Your next thought is, "I need to handle commas inside of quotes."
你的下一个想法是,“我需要处理引号内的逗号。”
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
你的下一个想法是,“哦,废话,我需要处理引号内的引号。转义引号。双引号。单引号......”
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.
这是一条疯狂之路。不要自己写。找到一个具有广泛单元测试覆盖率的库,它涵盖了所有困难的部分,并为您经历了地狱。对于 .NET,请使用免费的 FileHelpers 库。
回答by Zoman
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
我最近使用这段代码来解析 CSV 文件中的行(这是一个简化版本):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
回答by Yuriy Faktorovich
The LINQy solution:
LINQy 解决方案:
string csvText = File.ReadAllText("C:\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);