C# Tokenizer - 保留分隔符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1134311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 09:02:31  来源:igfitidea点击:

C# Tokenizer - keeping the separators

c#stringtokenizer

提问by Ipster

I am working on porting code from JAVA to C#, and part of the JAVA code uses tokenizer - but it is my understanding that the resulting array from the stringtokenizer in Java will also have the separators (in this case +, -, /, *, (, )) as tokens. I have attempted to use the C# Split() function, but it seems to eliminate the separators themselves. In the end, this will parse a string and run it as a calculation. I have done a lot of research, and have not found any references on the topic.

我正在将代码从 JAVA 移植到 C#,并且部分 JAVA 代码使用了分词器——但我的理解是,Java 中 stringtokenizer 的结果数组也将有分隔符(在这种情况下,+、-、/、* , (, )) 作为标记。我曾尝试使用 C# Split() 函数,但它似乎消除了分隔符本身。最后,这将解析一个字符串并将其作为计算运行。我做了很多研究,但没有找到任何关于该主题的参考。

Does anyone know how to get the actual separators, in the order they were encountered, to be in the split array?

有谁知道如何按照遇到的顺序将实际的分隔符放入拆分数组中?

Code for token-izing:

标记化代码:

public CalcLexer(String s)
{
    char[] seps = {'\t','\n','\r','+','-','*','/','(',')'};
    tokens = s.Split(seps);
    advance();
}

Testing:

测试:

static void Main(string[] args)
    {
        CalcLexer myCalc = new CalcLexer("24+3");
        Console.ReadLine();
    }

The "24+3" would result in the following output: "24", "3" I am looking for an output of "24", "+", "3"

“24+3”将导致以下输出:“24”、“3” 我正在寻找“24”、“+”、“3”的输出

In the nature of full disclosure, this project is part of a class assignment, and uses the following complete source code:

在完全公开的性质下,该项目是课堂作业的一部分,并使用以下完整源代码:

http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcParser.java.txthttp://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcLexer.java.txt

http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcParser.java.txt http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcLexer .java.txt

采纳答案by Pavel Minaev

You can use Regex.Splitwith zero-width assertions. For example, the following will split on +-*/:

您可以使用Regex.Split零宽度断言。例如,以下内容将拆分为+-*/

Regex.Split(str, @"(?=[-+*/])|(?<=[-+*/])");

Effectively this says, "split at this point if it is followed by, or preceded by, any of -+*/. The matched string itself will be zero-length, so you won't lose any part of the input string.

这实际上是说,“如果后面跟着或前面有任何-+*/.

回答by Sam Harwell

If you want a very flexible, powerful, reliable, and expandable solution, you can use the C# port of ANTLR. There is some initial overhead (link is setup information for VS2008)that would likely result in overkill for such a tiny project. Here's a calculator example with support for variables.

如果您想要一个非常灵活、强大、可靠且可扩展的解决方案,您可以使用ANTLRC# 端口。有一些初始开销(链接是 VS2008 的设置信息)可能会导致这样一个小项目的过度杀伤。这是一个支持变量计算器示例

Probably overkill for your class, but if you're interested in learning about "real" solutions to this type of real-world problem, have a look-see. I even have a Visual Studio package for working with the grammars, or you can use ANTLRWorksseparately.

对于您的课程来说可能有点矫枉过正,但是如果您有兴趣了解此类现实世界问题的“真实”解决方案,请看一看。我什至有一个用于处理语法Visual Studio 包,或者您可以单独使用ANTLRWorks

回答by Shane Castle

This produces your output:

这会产生您的输出:

string s = "24+3";
string seps = @"(\t)|(\n)|(\+)|(-)|(\*)|(/)|(\()|(\))";
string[] tokens = System.Text.RegularExpressions.Regex.Split(s, seps);

foreach (string token in tokens)
    Console.WriteLine(token);