C# 检测文本语言
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1464362/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Detect language of text
提问by Nikhil
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence"
, it should detect the language as "English"
. Or for "Esto es una sentencia"
it should detect the language as "Spanish"
.
是否有任何 C# 库可以检测特定文本片段的语言?即对于输入文本"This is a sentence"
,它应该将语言检测为"English"
。或者因为"Esto es una sentencia"
它应该将语言检测为"Spanish"
.
I understand that language detection from text is not a deterministic problem. But both Google Translateand Bing Translatorhave an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?
我知道从文本中检测语言不是确定性问题。但是谷歌翻译和必应翻译都有一个“自动检测”选项,可以最好地猜测输入语言。是否有类似的公开可用的东西,最好是在 C# 中?
回答by Arafangion
You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
你需要一个基于隐藏马尔可夫链的机器学习算法,处理一堆不同语言的文本。
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.
然后,当遇到未识别的文本时,“分数”越接近的语言就是赢家。
回答by Vinko Vrsalovic
Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
在这里,您有一个基于二元组统计的简单检测器(基本上意味着从大集合中学习哪些二元组在每种语言中出现的频率更高,然后计算一段文本中的那些二元组,与之前检测到的值进行比较):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
这对于许多(大多数?)应用程序来说可能已经足够了,并且不需要 Internet 访问。
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellentdetection performance you would have to do both a lot of hard work and over huge amounts of data.
当然,它的性能会比 Google 或 Bing 的算法(它们本身并不好)差。如果您需要出色的检测性能,您将不得不进行大量艰苦的工作并处理大量数据。
The other option would be to leverage Google's or Bing APIs if your app has Internet access.
如果您的应用可以访问 Internet ,另一种选择是利用Google或 Bing API。
回答by dreamlax
Language detection is a pretty hard thing to do.
语言检测是一件非常困难的事情。
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accentsare used almost exclusively in Hungarian. The dotless i‘?’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘?’ occurs only in German.
由于使用了变音符号和二合字母/三合字母,某些语言比其他语言更容易检测。例如,双重口音几乎只在匈牙利语中使用。无点的 i'?' 仅在土耳其语中 [我认为] 使用,t-逗号(不是 t-cedilla)仅在罗马尼亚语中使用,而 eszett '?' 仅在德语中出现。
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
一些二合字母、三合字母和四合字母也是很好的赠品。例如,您很可能会发现 'eeuw' 和 'ieuw' 主要是荷兰语,而 'tsch' 和 'dsch' 主要是德语等。
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
更多赠品将包括特定语言中使用的常用词或常用前缀/后缀。有时甚至使用的标点符号也可以帮助确定语言(引用样式和用法等)。
If such a library exists I would like to know about it, since I'm working on one myself.
如果存在这样的图书馆,我想知道它,因为我自己正在研究一个。
回答by Laurynas
There is a simple tool to identify text language: http://www.detectlanguage.com/
有一个简单的工具来识别文本语言:http: //www.detectlanguage.com/
回答by Matt Gibson
I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.
我发现“textcat”对此非常有用。我使用了一个 PHP 实现,PHP Text Cat,基于这个原始实现,并发现它可靠。如果您查看源代码,您会发现用您选择的语言实现它并不是一件非常困难的事情。艰苦的工作——与特定语言相关的字母组合——都在那里作为数据。
回答by Ivan Akcheurov
Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
的确,TextCat 非常适合语言识别。它有很多不同语言的实现。
There were no ports in .Net. So I have written one: NTextCat(NuGet, Online Demo).
.Net 中没有端口。所以我写了一个:NTextCat(NuGet,在线演示)。
It is pure .NET FrameworkDLL + command line interface to it. By default, it uses a profile of 14 languages.
它是纯 .NET FrameworkDLL + 命令行接口。默认情况下,它使用 14 种语言的配置文件。
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)
非常感谢任何反馈!也欢迎新想法和功能请求:)
回答by Sasvári Tamás
Please find a C# implementation based on of 3grams analysis here:
请在此处找到基于 3grams 分析的 C# 实现: