正则表达式:如何从字符串中获取单词(C#)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2159026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex : how to get words from a string (C#)
提问by Led
My input consists of user-posted strings.
我的输入由用户发布的字符串组成。
What I want to do is create a dictionary with words, and how often they've been used. This means I want to parse a string, remove all garbage, and get a list of words as output.
我想做的是创建一个包含单词的字典,以及它们的使用频率。这意味着我想解析一个字符串,删除所有垃圾,并获得一个单词列表作为输出。
For example, say the input is
"#@!@LOLOLOL YOU'VE BEEN \***PWN3D*** ! :') !!!1einszwei drei !"
例如,假设输入是
"#@!@LOLOLOL YOU'VE BEEN \***PWN3D*** ! :') !!!1einszwei drei !"
The output I need is the list:
我需要的输出是列表:
"LOLOLOL"
"YOU'VE"
"BEEN"
"PWN3D"
"einszwei"
"drei"
"LOLOLOL"
"YOU'VE"
"BEEN"
"PWN3D"
"einszwei"
"drei"
I'm no hero at regular expressions and have been Googling, but my Google-kungfu seams to be weak …
我不是正则表达式的英雄,一直在谷歌搜索,但我的谷歌功夫似乎很弱……
How would I go from input to the wanted output?
我将如何从输入到想要的输出?
采纳答案by John Gietzen
Simple Regex:
简单的正则表达式:
\w+
\w+
This matches a string of "word" characters. That is almostwhat you want.
这匹配一串“单词”字符。这几乎就是你想要的。
This is slightly more accurate:
这稍微更准确:
\w(?<!\d)[\w'-]*
\w(?<!\d)[\w'-]*
It matches any number of word characters, ensuring that the first character was not a digit.
它匹配任意数量的单词字符,确保第一个字符不是数字。
Here are my matches:
这是我的比赛:
1 LOLOLOL
2 YOU'VE
3 BEEN
4 PWN3D
5 einszwei
6 drei
1 LOLOLOL
2你
3 BEEN
4 PWN3D
5 einszwei
6 DREI
Now, that's more like it.
现在,这更像是它。
EDIT:
The reason for the negative look-behind, is that some regex flavors support Unicode characters. Using [a-zA-Z] would miss quite a few "word" characters that are desirable. Allowing \w
and disallowing \d
includes all Unicode characters that would conceivably start a word in any block of text.
编辑:
负面回顾的原因是一些正则表达式支持 Unicode 字符。使用 [a-zA-Z] 会遗漏很多需要的“单词”字符。允许\w
和禁止\d
包括所有 Unicode 字符,这些字符可能会在任何文本块中作为一个单词的开头。
EDIT 2:
I have found a more concise way to get the effect of the negative lookbehind: Double negative character class with a single negative exclusion.
编辑 2:
我找到了一种更简洁的方法来获得否定后视的效果:双否定字符类与单个否定排除。
[^\W\d][\w'-]*(?<=\w)
[^\W\d][\w'-]*(?<=\w)
This is the same as the above with the exception that it also ensures that the word endswith a word character. And, finally, there is:
这与上面的相同,除了它还确保单词以单词字符结尾。最后,还有:
[^\W\d](\w|[-']{1,2}(?=\w))*
[^\W\d](\w|[-']{1,2}(?=\w))*
Ensuring that there are no more than two non-word-characters in a row. Aka, It matches "word-up" but not "word--up", which makes sense. If you want it to match "word--up", but not "word---up", you can change the 2
to a 3
.
确保一行中的非单词字符不超过两个。Aka,它匹配“word-up”但不匹配“word--up”,这是有道理的。如果你希望它匹配“字-上升”,但不“字---”起来,你可以改变2
的3
。
回答by Mike Atlas
You should look into Natural Language Processing (NLP), not regular expressions, and if you are targeting more than one spoken language, you need to factor that in as well. Since you're using C#, check out the SharpNLPproject.
您应该研究自然语言处理 (NLP),而不是正则表达式,如果您的目标不止一种口语,则还需要将其考虑在内。由于您使用的是 C#,请查看SharpNLP项目。
Edit: This approach is only necessary if you care about the semantic content of the words you're trying to split up.
编辑:仅当您关心要拆分的单词的语义内容时,才需要这种方法。
回答by Jason
You don't necessarily need a regex for this, if tokenizing is all you're doing. First you could sanitize the string by removing all non-letter characters except for spaces and then do a Split()
on the space character. That will work for most everything, although contractions may be tough. That should get you started at least.
如果标记化就是你所做的一切,你不一定需要一个正则表达式。首先,您可以通过删除除空格之外的所有非字母字符来清理字符串,然后Split()
对空格字符执行 a 。这对大多数事情都有效,尽管收缩可能很困难。这至少应该让你开始。
回答by JSmyth
My gut feeling would not be to use regular expressions, but just do a loop or two.
我的直觉不会是使用正则表达式,而只是做一两个循环。
Iterate over each char in the string, if not a valid char, replace it with a space Then use String.Split() and split over spaces.
遍历字符串中的每个字符,如果不是有效字符,则将其替换为空格然后使用 String.Split() 并拆分空格。
Appostrophes and hyphens may be a little more tricky to determine if they are junk characters or legite ones. But if you are using a for loop to iterate over the string then looking backwards and forwards from the current character should help you.
撇号和连字符可能更难确定它们是垃圾字符还是合法字符。但是,如果您使用 for 循环遍历字符串,那么从当前字符向前和向后查看应该对您有所帮助。
Then you will have a list of words - for each of these words check if they are valid in your dictionary. If you want this to be fast, performing somekind of binary search would be best. But just to get it working a linear search would be easier to start with.
然后您将获得一个单词列表 - 对于这些单词中的每一个,请检查它们在您的字典中是否有效。如果您希望这很快,最好执行某种二进制搜索。但只是为了让它工作,线性搜索会更容易开始。
EDIT: I only mentioned the dictionary thing because I thought you might be interested only in legitimate words, ie not "asdfasdf" but ignore that last statement if that's not what you need.
编辑:我只提到字典的事情,因为我认为你可能只对合法的词感兴趣,即不是“asdfasdf”,但如果这不是你需要的,请忽略最后一条语句。
回答by Greg Bacon
Using the following
使用以下
var pattern = new Regex(
@"( [^\W_\d] # starting with a letter
# followed by a run of either...
( [^\W_\d] | # more letters or
[-'\d](?=[^\W_\d]) # ', -, or digit followed by a letter
)*
[^\W_\d] # and finishing with a letter
)",
RegexOptions.IgnorePatternWhitespace);
var input = "#@!@LOLOLOL YOU'VE BEEN *PWN3D* ! :') !!!1einszwei drei foo--bar!";
foreach (Match m in pattern.Matches(input))
Console.WriteLine("[{0}]", m.Groups[1].Value);
produces output of
产生输出
[LOLOLOL] [YOU'VE] [BEEN] [PWN3D] [einszwei] [drei] [foo] [bar]
回答by user8846868
I wrote an extension for String like this:
我为 String 写了一个扩展,如下所示:
private static string[] GetWords(string text)
{
List<string> lstreturn = new List<string>();
List<string> lst = text.Split(new[] { ' ' }).ToList();
foreach (string str in lst)
{
if (str.Trim() == "")
{
lstreturn.Add(str);
}
}
return lstreturn.ToArray();
}