C# 从字符串中删除特殊字符的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1120198/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 08:40:30  来源:igfitidea点击:

Most efficient way to remove special characters from string

c#string

提问by ObiWanKenobi

I want to remove all special characters from a string. Allowed characters are A-Z (uppercase or lowercase), numbers (0-9), underscore (_), or the dot sign (.).

我想从字符串中删除所有特殊字符。允许的字符是 AZ(大写或小写)、数字 (0-9)、下划线 (_) 或点符号 (.)。

I have the following, it works but I suspect (I know!) it's not very efficient:

我有以下内容,它有效,但我怀疑(我知道!)它效率不高:

    public static string RemoveSpecialCharacters(string str)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < str.Length; i++)
        {
            if ((str[i] >= '0' && str[i] <= '9')
                || (str[i] >= 'A' && str[i] <= 'z'
                    || (str[i] == '.' || str[i] == '_')))
                {
                    sb.Append(str[i]);
                }
        }

        return sb.ToString();
    }

What is the most efficient way to do this? What would a regular expression look like, and how does it compare with normal string manipulation?

执行此操作的最有效方法是什么?正则表达式会是什么样子,它与普通字符串操作相比如何?

The strings that will be cleaned will be rather short, usually between 10 and 30 characters in length.

将被清理的字符串会很短,通常在 10 到 30 个字符之间。

采纳答案by Guffa

Why do you think that your method is not efficient? It's actually one of the most efficient ways that you can do it.

为什么你认为你的方法效率不高?这实际上是您可以做到的最有效的方法之一。

You should of course read the character into a local variable or use an enumerator to reduce the number of array accesses:

您当然应该将字符读入局部变量或使用枚举器来减少数组访问的次数:

public static string RemoveSpecialCharacters(this string str) {
   StringBuilder sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

One thing that makes a method like this efficient is that it scales well. The execution time will be relative to the length of the string. There is no nasty surprises if you would use it on a large string.

使这种方法高效的一件事是它的扩展性很好。执行时间将与字符串的长度相关。如果您将它用在大字符串上,则不会有什么令人讨厌的惊喜。

Edit:
I made a quick performance test, running each function a million times with a 24 character string. These are the results:

编辑:
我做了一个快速的性能测试,用 24 个字符的字符串运行每个函数一百万次。这些是结果:

Original function: 54.5 ms.
My suggested change: 47.1 ms.
Mine with setting StringBuilder capacity: 43.3 ms.
Regular expression: 294.4 ms.

原始函数:54.5 毫秒。
我建议的更改:47.1 毫秒。
我的设置 StringBuilder 容量:43.3 毫秒。
正则表达式:294.4 毫秒。

Edit 2: I added the distinction between A-Z and a-z in the code above. (I reran the performance test, and there is no noticable difference.)

编辑 2:我在上面的代码中添加了 AZ 和 az 之间的区别。(我重新进行了性能测试,没有明显的差异。)

Edit 3:
I tested the lookup+char[] solution, and it runs in about 13 ms.

编辑 3:
我测试了 lookup+char[] 解决方案,它在大约 13 毫秒内运行。

The price to pay is, of course, the initialization of the huge lookup table and keeping it in memory. Well, it's not that much data, but it's much for such a trivial function...

当然,要付出的代价是初始化庞大的查找表并将其保存在内存中。好吧,它不是那么多数据,但对于这样一个微不足道的功能来说却很多......

private static bool[] _lookup;

static Program() {
   _lookup = new bool[65536];
   for (char c = '0'; c <= '9'; c++) _lookup[c] = true;
   for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
   for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
   _lookup['.'] = true;
   _lookup['_'] = true;
}

public static string RemoveSpecialCharacters(string str) {
   char[] buffer = new char[str.Length];
   int index = 0;
   foreach (char c in str) {
      if (_lookup[c]) {
         buffer[index] = c;
         index++;
      }
   }
   return new string(buffer, 0, index);
}

回答by Stephen Wrighton

I would use a String Replace with a Regular Expression searching for "special characters", replacing all characters found with an empty string.

我将使用带有正则表达式的字符串替换来搜索“特殊字符”,用空字符串替换找到的所有字符。

回答by Steven Sudit

I suggest creating a simple lookup table, which you can initialize in the static constructor to set any combination of characters to valid. This lets you do a quick, single check.

我建议创建一个简单的查找表,您可以在静态构造函数中对其进行初始化以将任何字符组合设置为有效。这使您可以进行快速、单一的检查。

edit

编辑

Also, for speed, you'll want to initialize the capacity of your StringBuilder to the length of your input string. This will avoid reallocations. These two methods together will give you both speed and flexibility.

此外,为了速度,您需要将 StringBuilder 的容量初始化为输入字符串的长度。这将避免重新分配。这两种方法一起会给你带来速度和灵活性。

another edit

另一个编辑

I think the compiler might optimize it out, but as a matter of style as well as efficiency, I recommend foreach instead of for.

我认为编译器可能会优化它,但出于风格和效率的考虑,我建议使用 foreach 而不是 for。

回答by Blixt

Well, unless you really need to squeeze the performance out of your function, just go with what is easiest to maintain and understand. A regular expression would look like this:

好吧,除非你真的需要从你的函数中挤出性能,否则就选择最容易维护和理解的东西。正则表达式如下所示:

For additional performance, you can either pre-compile it or just tell it to compile on first call (subsequent calls will be faster.)

为了获得额外的性能,您可以预编译它或只是告诉它在第一次调用时编译(后续调用会更快。)

public static string RemoveSpecialCharacters(string str)
{
    return Regex.Replace(str, "[^a-zA-Z0-9_.]+", "", RegexOptions.Compiled);
}

回答by CMS

A regular expression will look like:

正则表达式将如下所示:

public string RemoveSpecialChars(string input)
{
    return Regex.Replace(input, @"[^0-9a-zA-Z\._]", string.Empty);
}

But if performance is highly important, I recommend you to do some benchmarks before selecting the "regex path"...

但是如果性能非常重要,我建议您在选择“正则表达式路径”之前做一些基准测试......

回答by bruno conde

It seems good to me. The only improvement I would make is to initialize the StringBuilderwith the length of the string.

对我来说似乎很好。我要做的唯一改进是StringBuilder用字符串的长度初始化。

StringBuilder sb = new StringBuilder(str.Length);

回答by lc.

I'm not convinced your algorithm is anything but efficient. It's O(n) and only looks at each character once. You're not gonna get any better than that unless you magically know values before checking them.

我不相信你的算法是有效的。它是 O(n) 并且只查看每个字符一次。除非您在检查值之前神奇地知道值,否则您不会得到比这更好的结果。

I would however initialize the capacity of your StringBuilderto the initial size of the string. I'm guessing your perceived performance problem comes from memory reallocation.

但是,我会将您的容量初始化StringBuilder为字符串的初始大小。我猜您感知到的性能问题来自内存重新分配。

Side note: Checking A-zis not safe. You're including [, \, ], ^, _, and `...

旁注:检查A-z不安全。你包括[\]^_,和`...

Side note 2: For that extra bit of efficiency, put the comparisons in an order to minimize the number of comparisons. (At worst, you're talking 8 comparisons tho, so don't think too hard.) This changes with your expected input, but one example could be:

旁注 2:为了提高效率,请按顺序进行比较,以尽量减少比较次数。(最坏的情况是,你在谈论 8 个比较,所以不要想得太难。)这会随着你的预期输入而变化,但一个例子可能是:

if (str[i] >= '0' && str[i] <= 'z' && 
    (str[i] >= 'a' || str[i] <= '9' ||  (str[i] >= 'A' && str[i] <= 'Z') || 
    str[i] == '_') || str[i] == '.')

Side note 3: If for whatever reason you REALLY need this to be fast, a switch statement may be faster. The compiler should create a jump table for you, resulting in only a single comparison:

旁注 3:如果出于某种原因,您确实需要这样做,那么 switch 语句可能会更快。编译器应该为你创建一个跳转表,导致只有一个比较:

switch (str[i])
{
    case '0':
    case '1':
    .
    .
    .
    case '.':
        sb.Append(str[i]);
        break;
}

回答by Christian Klauser

I wonder if a Regex-based replacement (possibly compiled) is faster. Would have to test thatSomeone has found this to be ~5 times slower.

我想知道基于正则表达式的替换(可能已编译)是否更快。必须测试有人发现这慢了大约 5 倍。

Other than that, you should initialize the StringBuilder with an expected length, so that the intermediate string doesn't have to be copied around while it grows.

除此之外,您应该使用预期的长度初始化 StringBuilder,以便在中间字符串增长时不必复制它。

A good number is the length of the original string, or something slightly lower (depending on the nature of the functions inputs).

一个很好的数字是原始字符串的长度,或者稍低一些(取决于函数输入的性质)。

Finally, you can use a lookup table (in the range 0..127) to find out whether a character is to be accepted.

最后,您可以使用查找表(在 0..127 范围内)来确定是否要接受一个字符。

回答by Triynko

If you're worried about speed, use pointers to edit the existing string. You could pin the string and get a pointer to it, then run a for loop over each character, overwriting each invalid character with a replacement character. It would be extremely efficient and would not require allocating any new string memory. You would also need to compile your module with the unsafe option, and add the "unsafe" modifier to your method header in order to use pointers.

如果您担心速度,请使用指针编辑现有字符串。您可以固定字符串并获取指向它的指针,然后对每个字符运行 for 循环,用替换字符覆盖每个无效字符。这将非常有效,并且不需要分配任何新的字符串内存。您还需要使用 unsafe 选项编译您的模块,并将“unsafe”修饰符添加到您的方法头中以使用指针。

static void Main(string[] args)
{
    string str = "string!$%with^&*invalid!!characters";
    Console.WriteLine( str ); //print original string
    FixMyString( str, ' ' );
    Console.WriteLine( str ); //print string again to verify that it has been modified
    Console.ReadLine(); //pause to leave command prompt open
}


public static unsafe void FixMyString( string str, char replacement_char )
{
    fixed (char* p_str = str)
    {
        char* c = p_str; //temp pointer, since p_str is read-only
        for (int i = 0; i < str.Length; i++, c++) //loop through each character in string, advancing the character pointer as well
            if (!IsValidChar(*c)) //check whether the current character is invalid
                (*c) = replacement_char; //overwrite character in existing string with replacement character
    }
}

public static bool IsValidChar( char c )
{
    return (c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c == '.' || c == '_');
    //return char.IsLetterOrDigit( c ) || c == '.' || c == '_'; //this may work as well
}

回答by LukeH

public static string RemoveSpecialCharacters(string str)
{
    char[] buffer = new char[str.Length];
    int idx = 0;

    foreach (char c in str)
    {
        if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z')
            || (c >= 'a' && c <= 'z') || (c == '.') || (c == '_'))
        {
            buffer[idx] = c;
            idx++;
        }
    }

    return new string(buffer, 0, idx);
}