C#,正则表达式:如何解析逗号分隔的值,其中一些值可能是包含逗号的引号字符串本身

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1189416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 10:23:59  来源:igfitidea点击:

C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas

c#regexcsv

提问by JaysonFix

In C#, using the Regexclass, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?

在 C# 中,使用Regex该类,如何解析逗号分隔的值,其中一些值可能是包含逗号的引号字符串本身?

using System ;
using System.Text.RegularExpressions ;

class  Example
    {
    public static void Main ( )
        {
        string  myString  =  "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
        Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
        Regex   regex  =  new Regex  (  "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)"  )  ;
        Match   match  =  regex.Match ( myString ) ;
        int j = 0 ;
        while ( match.Success )
            {
            Console.WriteLine ( j++ + " \t" + match ) ;
            match  =  match.NextMatch() ;
            }
        }
    }

Output (in part) appears as follows:

输出(部分)如下所示:

0       cat
1       dog
2       "0 = OFF
3        1 = ON"
4       lion
5       tiger
6       'R = red
7        G = green
8        B = blue'
9       bear

However, desiredoutput is:

但是,所需的输出是:

0       cat
1       dog
2       0 = OFF, 1 = ON
3       lion
4       tiger
5       R = red, G = green, B = blue
6       bear

采纳答案by CMS

Try with this Regex:

试试这个正则表达式:

"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*


    Regex regexObj = new Regex(@"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
    Match matchResults = regexObj.Match(input);
    while (matchResults.Success) 
    {
        Console.WriteLine(matchResults.Value);
        matchResults = matchResults.NextMatch();
    }

Ouputs:

输出:

  • cat
  • dog
  • "0 = OFF, 1 = ON"
  • lion
  • tiger
  • 'R = red, G = green, B = blue'
  • bear
  • “0 = 关,1 = 开”
  • 狮子
  • 老虎
  • 'R = 红色,G = 绿色,B = 蓝色'

Note:This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.

注意:此正则表达式解决方案适用于您的情况,但我建议您使用像FileHelpers这样的专门库。

回答by kenwarner

it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.

它不是正则表达式,但我使用 Microsoft.VisualBasic.FileIO.TextFieldParser 为 csv 文件完成此操作。是的,在 C# 应用程序中添加对 Microsoft.VisualBasic 的引用可能会感觉有点奇怪,甚至可能有点脏,但嘿它有效。

回答by Judah Gabriel Himango

Why not heed the advice from the experts and Don't roll your own CSV parser.

为什么不听从专家的建议,不要推出自己的 CSV 解析器

Your first thought is, "I need to handle commas inside of quotes."

您的第一个想法是,“我需要处理引号内的逗号。”

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

你的下一个想法是,“哦,废话,我需要处理引号内的引号。转义引号。双引号。单引号......”

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.

这是一条疯狂之路。不要自己写。找到一个具有广泛单元测试覆盖率的库,它涵盖了所有困难的部分,并为您经历了地狱。对于 .NET,请使用免费和开源的FileHelpers 库

回答by codekaizen

Ah, RegEx. Now you have two problems. ;)

啊,正则表达式。现在你有两个问题。;)

I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.

我会使用标记器/解析器,因为它非常简单,更重要的是,更容易阅读以供以后维护。

This works, for example:

这有效,例如:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green,     B = blue',bear"; 
        Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
        CsvParser parser = new CsvParser(myString);

        Int32 lineNumber = 0;
        foreach (string s in parser)
        {
            Console.WriteLine(lineNumber + ": " + s);
        }

        Console.ReadKey();
    }
}

internal enum TokenType
{
    Comma,
    Quote,
    Value
}

internal class Token
{
    public Token(TokenType type, string value)
    {
        Value = value;
        Type = type;
    }

    public String Value { get; private set; }
    public TokenType Type { get; private set; }
}

internal class StreamTokenizer : IEnumerable<Token>
{
    private TextReader _reader;

    public StreamTokenizer(TextReader reader)
    {
        _reader = reader;    
    }

    public IEnumerator<Token> GetEnumerator()
    {
        String line;
        StringBuilder value = new StringBuilder();

        while ((line = _reader.ReadLine()) != null)
        {
            foreach (Char c in line)
            {
                switch (c)
                {
                    case '\'':
                    case '"':
                        if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Quote, c.ToString());
                        break;
                    case ',':
                       if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Comma, c.ToString());
                        break;
                    default:
                        value.Append(c);
                        break;
                }
            }

            // Thanks, dpan
            if (value.Length > 0) 
            {
                yield return new Token(TokenType.Value, value.ToString()); 
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

internal class CsvParser : IEnumerable<String>
{
    private StreamTokenizer _tokenizer;

    public CsvParser(Stream data)
    {
        _tokenizer = new StreamTokenizer(new StreamReader(data));
    }

    public CsvParser(String data)
    {
        _tokenizer = new StreamTokenizer(new StringReader(data));
    }

    public IEnumerator<string> GetEnumerator()
    {
        Boolean inQuote = false;
        StringBuilder result = new StringBuilder();

        foreach (Token token in _tokenizer)
        {
            switch (token.Type)
            {
                case TokenType.Comma:
                    if (inQuote)
                    {
                        result.Append(token.Value);
                    }
                    else
                    {
                        yield return result.ToString();
                        result.Length = 0;
                    }
                    break;
                case TokenType.Quote:
                    // Toggle quote state
                    inQuote = !inQuote;
                    break;
                case TokenType.Value:
                    result.Append(token.Value);
                    break;
                default:
                    throw new InvalidOperationException("Unknown token type: " +    token.Type);
            }
        }

        if (result.Length > 0)
        {
            yield return result.ToString();
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

回答by Partha Choudhury

Function:

功能:

    private List<string> ParseDelimitedString (string arguments, char delim = ',')
    {
        bool inQuotes = false;
        bool inNonQuotes = false; //used to trim leading WhiteSpace

        List<string> strings = new List<string>();

        StringBuilder sb = new StringBuilder();
        foreach (char c in arguments)
        {
            if (c == '\'' || c == '"')
            {
                if (!inQuotes)
                    inQuotes = true;
                else
                    inQuotes = false;
            }else if (c == delim)
            {
                if (!inQuotes)
                {
                    strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
                    sb.Remove(0, sb.Length);
                    inNonQuotes = false;
                }
                else
                {
                    sb.Append(c);
                }
            }
            else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)  
            {
                if (!inNonQuotes) inNonQuotes = true;
                sb.Append(c);
            }
        }
        strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());


        return strings;
    }

Usage

用法

    string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear,         text";
    List<string> strings = ParseDelimitedString(myString);

    foreach( string s in strings )
            Console.WriteLine( s );

Output:

输出:

cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text

回答by ShuggyCoUk

CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quitehandled by the last regex.

CSV 不是常规的。除非您的正则表达式语言有足够的能力来处理 csv 解析的有状态性质(不太可能,MS 没有),否则任何纯正则表达式解决方案都是等待发生的错误列表,因为您遇到了一个尚未完全处理的新输入源通过最后一个正则表达式。

CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.

CSV 读取作为状态机编写起来并不复杂,因为语法很简单,但即便如此,您也必须考虑:带引号的引号、引号内的逗号、引号内的新行、空字段。

As such you should probably just use someone else's CSV parser. I recommend CSVReaderfor .Net

因此,您可能应该只使用其他人的 CSV 解析器。我为 .Net推荐CSVReader

回答by Joshua

Just adding the solution I worked on this morning.

只需添加我今天早上工作的解决方案。

var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");

foreach (Match m in regex.Matches("<-- input line -->"))
{
    var s = m.Value; 
}

As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.

如您所见,您需要每行调用 regex.Matches() 。然后它会返回一个 MatchCollection ,它的项目数与列相同。显然,每个匹配项的 Value 属性是解析后的值。

This is still a work in progress, but it happily parses CSV strings like:

这仍然是一项正在进行的工作,但它很高兴解析 CSV 字符串,例如:

2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D

回答by David Wayne Rasmussen

I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.

我在该版本中发现了一些错误,例如,值中有一个单引号的非引号字符串。

And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.

我同意尽可能使用 FileHelper 库,但是该库要求您知道数据的外观......我需要一个通用解析器。

So I've updated the code to the following and thought I'd share...

所以我已将代码更新为以下内容,并认为我会分享...

    static public List<string> ParseDelimitedString(string value, char delimiter)
    {
        bool inQuotes = false;
        bool inNonQuotes = false;
        bool secondQuote = false;
        char curQuote = '
def csv_to_array(string):
    stack = []
    match = []
    matches = []

    for c in string:
        # do we have a quote or double quote?
        if c == "\"":
            # is it a closing match?
            if len(stack) > 0 and stack[-1] == c:
                stack.pop()
            else:
                stack.append(c)
        elif (c == "," and len(stack) == 0) or (c == "\n"):
            matches.append("".join(match))
            match = []
        else:
            match.append(c)

    return matches
'; List<string> results = new List<string>(); StringBuilder sb = new StringBuilder(); foreach (char c in value) { if (inNonQuotes) { // then quotes are just characters if (c == delimiter) { results.Add(sb.ToString()); sb.Remove(0, sb.Length); inNonQuotes = false; } else { sb.Append(c); } } else if (inQuotes) { // then quotes need to be double escaped if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote)) { if (secondQuote) { secondQuote = false; sb.Append(c); } else secondQuote = true; } else if (secondQuote && c == delimiter) { results.Add(sb.ToString()); sb.Remove(0, sb.Length); inQuotes = false; } else if (!secondQuote) { sb.Append(c); } else { // bad,as,"user entered something like"this,poorly escaped,value // just ignore until second delimiter found } } else { // not yet parsing a field if (c == '\'' || c == '"') { curQuote = c; inQuotes = true; inNonQuotes = false; secondQuote = false; } else if (c == delimiter) { // blank field inQuotes = false; inNonQuotes = false; results.Add(string.Empty); } else { inQuotes = false; inNonQuotes = true; sb.Append(c); } } } if (inQuotes || inNonQuotes) results.Add(sb.ToString()); return results; }

回答by MrE

since this question: Regex to to parse csv with nested quotes

因为这个问题:Regex to to parse csv withnestedquotes

reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)

在这里报告并且更通用,并且由于 RegEx 并不是解决这个问题的真正正确方法(即我在灾难性回溯方面遇到了很多问题(http://www.regular-expressions.info/catastrophic.html

here is a simple parser implementation in Python as well

这里也是一个简单的 Python 解析器实现

##代码##