c#在html中查找图像并下载它们

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1263266/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 14:22:58  来源:igfitidea点击:

c# find image in html and download them

c#

提问by madman

i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"

我想下载存储在 html(网页)中的所有图像,我不知道将下载多少图像,而且我不想使用“HTML AGILITY PACK”

i search in google but all site make me more confused ,

我在谷歌搜索,但所有网站都让我更加困惑,

i tried regex but only one result ... ,

我试过正则表达式,但只有一个结果......,

回答by Szere Dyeri

In general terms

笼统

  1. You need to fetch the html page
  2. Search for img tags and extract the src="..." portion out of them
  3. Keep a list of all these extracted image urls.
  4. Download them one by one.
  1. 您需要获取 html 页面
  2. 搜索 img 标签并从中提取 src="..." 部分
  3. 保留所有这些提取的图像 url 的列表。
  4. 一一下载。

Maybe this question about C# HTML parserwill help you a little bit more.

也许这个关于C# HTML 解析器的问题会对你有所帮助。

回答by Steve Gilham

You can use a WebBrowser control and extract the HTML from that e.g.

您可以使用 WebBrowser 控件并从中提取 HTML,例如

System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...

or directly invoke the IHTMLDocumentfamily of COM interfaces

或直接调用IHTMLDocumentCOM 接口系列

回答by Joel Coehoorn

First of all I just can't leave this phrase alone:

首先,我不能单独留下这句话:

images stored in html

存储在 html 中的图像

That phrase is probably a big part of the reason your question was down-voted twice. Images are notstored in html. Html pages have references to images that web browsers download separately.

这句话可能是您的问题两次被否决的重要原因。图像存储在 html 中。Html 页面引用了 Web 浏览器单独下载的图像。

This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.

这意味着您需要分三步执行此操作:首先下载 html,然后在 html 中找到图像引用,最后使用这些引用下载图像本身。

To accomplish this, look at the System.Net.WebClient()class. It has a .DownloadString()method you can use to get the html. Then you need to find all the <img />tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData()or DownloadFile()methods to retrieve the images.

要做到这一点,请查看System.Net.WebClient()类。它有一个.DownloadString()方法可以用来获取 html。然后你需要找到所有的<img />标签。你在这里拥有自己的东西,但这很简单。最后,您使用 WebClient 的.DownloadData()DownloadFile()方法来检索图像。

回答by Jon Galloway

People are giving you the right answer - you can't be picky and lazy, too. ;-)

人们给了你正确的答案——你也不能挑剔和偷懒。;-)

If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack(it's included in the HTML Agility Pack download).

如果您使用半生不熟的解决方案,您将处理很多边缘情况。这是一个工作示例,它使用HTML Agility Pack(它包含在 HTML Agility Pack 下载中)获取 HTML 文档中的所有链接。

And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ

这是一篇博客文章,展示了如何使用 HTML Agility Pack 和 LINQ 抓取 HTML 文档中的所有图像

    // Bing Image Result for Cat, First Page
    string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";

    // For speed of dev, I use a WebClient
    WebClient client = new WebClient();
    string html = client.DownloadString(url);

    // Load the Html into the agility pack
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    // Now, using LINQ to get all Images
    List<HtmlNode> imageNodes = null;
    imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
                  where node.Name == "img"
                  && node.Attributes["class"] != null
                  && node.Attributes["class"].Value.StartsWith("img_")
                  select node).ToList();

    foreach(HtmlNode node in imageNodes)
    {
        Console.WriteLine(node.Attributes["src"].Value);
    }