C# 如何检测字符串的语言?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1192768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 10:29:52  来源:igfitidea点击:

How to detect the language of a string?

c#language-detection

提问by Alon Gubkin

What's the best way to detect the language of a string?

检测字符串语言的最佳方法是什么?

采纳答案by Magnus Johansson

If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

如果您的代码上下文可以访问互联网,您可以尝试使用 Google API 进行语言检测。 http://code.google.com/apis/ajaxlanguage/documentation/

var text = "?Dónde está el ba?o?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this articleon how to call the API from c#.

而且,由于您使用的是 c#,请查看有关如何从 c# 调用 API 的这篇文章

UPDATE: That c# link is gone, here's a cached copy of the core of it:

更新:那个 c# 链接不见了,这是它的核心的缓存副本:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
   new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
   key);

TextBoxTranslation.Text = gTranslator.Translation;


Basically, you need to create a URI and send it to Google that looks like:

基本上,您需要创建一个 URI 并将其发送给 Google,如下所示:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

这告诉 API 您要将“hello world”从英语翻译成希伯来语,Google 的 JSON 响应将如下所示:

{"responseData": {"translatedText":"???? ?????"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

我选择创建一个代表典型 Google JSON 响应的基类:

[Serializable]
public class JSONResponse
{
   public string responseDetails = null;
   public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

然后,一个从这个类继承的 Translation 对象:

[Serializable]
public class Translation: JSONResponse
{
   public TranslationResponseData responseData = 
    new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

这个 Translation 类有一个 TranslationResponseData 对象,如下所示:

[Serializable]
public class TranslationResponseData
{
   public string translatedText;
}

Finally, we can make the GoogleTranslator class:

最后,我们可以制作 GoogleTranslator 类:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

   public class GoogleTranslator
   {
      private string _q = "";
      private string _v = "";
      private string _key = "";
      private string _langPair = "";
      private string _requestUrl = "";
      private string _translation = "";

      public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
         LANGUAGE languageTo, string key)
      {
         _q = HttpUtility.UrlPathEncode(queryTerm);
         _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
         _langPair =
            HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
            "|" + EnumStringUtil.GetStringValue(languageTo));
         _key = HttpUtility.UrlEncode(key);

         string encodedRequestUrlFragment =
            string.Format("?v={0}&q={1}&langpair={2}&key={3}",
            _v, _q, _langPair, _key);

         _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

         GetTranslation();
      }

      public string Translation
      {
         get { return _translation; }
         private set { _translation = value; }
      }

      private void GetTranslation()
      {
         try
         {
            WebRequest request = WebRequest.Create(_requestUrl);
            WebResponse response = request.GetResponse();

            StreamReader reader = new StreamReader(response.GetResponseStream());
            string json = reader.ReadLine();
            using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
            {
               DataContractJsonSerializer ser =
                  new DataContractJsonSerializer(typeof(Translation));
               Translation translation = ser.ReadObject(ms) as Translation;

               _translation = translation.responseData.translatedText;
            }
         }
         catch (Exception) { }
      }
   }
}

回答by GvS

Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

对字符串进行统计分析:将字符串拆分为单词。为您要测试的每种语言获取一本词典。然后找到字数最高的语言。

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

在 C# 中,内存中的每个字符串都是 unicode,并且不会被编码。同样在文本文件中,不存储编码。(有时仅指示 8 位或 16 位)。

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

如果您想区分两种语言,您可能会发现一些简单的技巧。例如,如果您想从荷兰语中识别英语,则包含“y”的字符串主要是英语。(不可靠但很快)。

回答by AakashM

If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

如果您指的是自然(即人类)语言,这通常是一个难题。“服务器”是什么语言 - 英语还是土耳其语?“聊天”是什么语言 - 英语还是法语?“uno”是什么语言 - 意大利语或西班牙语(或拉丁语!)?

Without paying attention to context, and doing some hard natural language processing(<----- this is the phrase to google for) you haven't got a chance.

如果不注意上下文,并且不进行一些困难的自然语言处理(<----- 这是谷歌搜索的短语),您就没有机会。

You might enjoy a look at Frengly- it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

您可能会喜欢看Frengly- 它是 Google 翻译服务上的一个很好的用户界面,它试图猜测输入文本的语言......

回答by Greg Hewgill

A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency(one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

使用二合字母或三合字母的统计方法是一个非常好的指标。例如,以下是最常见的英语有向图:http: //www.letterfrequency.org/#digraph-frequency(可以找到更好或更完整的列表)。对于短文本片段,这种方法可能比单词分析具有更好的成功率,因为​​文本中的有向图比完整单词多。

回答by Ivan Akcheurov

Fast answer:NTextCat(NuGet, Online Demo)

快速回答:NTextCatNuGet在线演示

Long answer:

长答案:

Currently the best way seems to use classifiers trainedto classify piece of text into one (or more) of languages from predefined set.

目前最好的方法似乎是使用训练有素的分类器将一段文本从预定义的集合中分类为一种(或多种)语言。

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

有一个名为TextCat的 Perl 工具。它有 74 种最流行语言的语言模型。该工具有大量移植到不同的编程语言中。

There were no ports in .Net. So I have written one: NTextCaton GitHub.

.Net 中没有端口。所以我写了一篇:NTextCaton GitHub

It is pure .NET FrameworkDLL + command line interface to it. By default, it uses a profile of 14 languages.

它是纯 .NET FrameworkDLL + 命令行接口。默认情况下,它使用 14 种语言的配置文件。

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

非常感谢任何反馈!也欢迎新想法和功能请求:)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

替代方法是使用多种在线服务(例如,来自 Google 的一项提到的、detectlanguage.com、langid.net 等)。

回答by f3lix

CLD3 (Compact Language Detector v3)library from Google's Chromium browser

来自 Google Chromium 浏览器的CLD3(Compact Language Detector v3)

You could wrap the CLD3 library, which is written in C++.

您可以包装用 C++ 编写的CLD3 库

回答by ariful islam

We can use Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+")to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.
To detect Arabic:

我们可以用它Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+")来检测特定的语言。这里 xxxx 是一个字符的 4 位 Unicode id。
检测阿拉伯语:

bool isArabic = Regex.IsMatch(yourtext, @"[\u0600-\u06FF]+")

回答by Reg Edit

You may use the C# package for language identificationfrom Microsoft Research:

您可以使用Microsoft Research的C# 包进行语言识别

This package implements several algorithms for language identification, and includes two sets of pre-compiled language profiles. One set covers 52 languages and was trained on Wikipedia (i.e. a well-written corpus); the other covers 26 languages and was constructed from Twitter (i.e. a highly colloquial corpus). The language identifiers are packaged up as a C# library, and be easily embedded into other C# projects.

该软件包实现了多种语言识别算法,并包括两组预编译的语言配置文件。一套涵盖 52 种语言,并在维基百科(即编写良好的语料库)上进行训练;另一个涵盖了 26 种语言,是从 Twitter 构建的(即高度口语化的语料库)。语言标识符打包为 C# 库,可以轻松嵌入到其他 C# 项目中。

Download the package from the above link.

从上面的链接下载包。

回答by NGambit

One alternative is to use 'Translator Text API' which is

一种替代方法是使用“翻译文本 API”,它是

... part of the Azure Cognitive Services API collection of machine learning and AI algorithms in the cloud, and is readily consumable in your development projects

... 云中机器学习和 AI 算法的 Azure 认知服务 API 集合的一部分,可在您的开发项目中轻松使用

Here's a quickstart guideon how to detect language from text using this API

这是有关如何使用此 API 从文本中检测语言的快速入门指南