jsoup Java HTML解析器
jsoup是一个开放源代码Java HTML解析器,我们可以用来解析HTML并提取有用的信息。
您也可以将jsoup视为Java编程语言中的网页抓取工具。
jsoup
jsoup API可用于从URL提取HTML或者从HTML字符串或者HTML文件解析它。
jsoup API的一些很酷的功能是;
从URL抓取HTML或者从String或者文件中读取HTML。
通过基于DOM的遍历或者使用类似选择器CSS从html提取数据。
jsoup API也可以用于编辑HTML。
jsoup API是自包含的,我们不需要任何其他jar即可使用它。
您可以从其下载jsoup jar,或者如果您使用的是maven,则为其添加以下依赖项。
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.1</version> </dependency>
让我们一一探讨不同的jsoup功能。
jsoup示例从URL加载HTML文档
我们可以使用一个线性代码来做到这一点,如下所示。
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("https://www.theitroad.local").get(); System.out.println(doc.html()); //prints HTML data
jsoup示例从String解析HTML文档
如果我们有HTML数据作为String,则可以使用以下代码对其进行解析。
String source = "<html><head><title>Jsoup Example</title></head>" + "<body><h1>Welcome to theitroad!!</h1><br " + "</body></html>"; Document doc = Jsoup.parse(source);
jsoup示例从文件加载文档
如果HTML数据保存在文件中,我们可以使用以下代码加载它。
Document doc = Jsoup.parse(new File("data.html"), "UTF-8");
解析HTML正文片段
jsoup的最佳功能之一是,如果我们提供html正文片段数据,它将努力为我们生成有效HTML,如下例所示。
String html = "<div><p>Test Data</p>"; Document doc1 = Jsoup.parseBodyFragment(html); System.out.println(doc1.html());
上面的代码显示以下HTML。
<html> <head></head> <body> <div> <p>Test Data</p> </div> </body> </html>
现在,让我们看一下从HTML提取数据的不同方法。
Jsoup DOM方法
就像HTML一样,Jsoup将HTML解析为Document。
文档包含不同的元素,可以使用许多有用的方法来查找元素。
Document中的一些方法是:
- getElementById(字符串ID)
- getElementsByTag(字符串标签)
- getElementsByClass(String className)
- getElementsByAttribute(字符串键)
- siblingElements(),firstElementSibling(),lastElementSibling()等。
元素具有不同的属性,因此我们也有一些元素数据的方法。
- attr(String key)获取和attr(String key,字符串值)设置属性
- id(),className()和classNames()
- 要获取的text()和要设置文本内容的text(String value)
- html()获取和html(String value)设置内部HTML内容
- tag()和tagName()
也有一些用于处理HTML数据的方法。
- append(String html),prepend(String html)
- appendText(字符串文本),prependText(字符串文本)
- appendElement(String tagName),prependElement(String tagName)
- html(字符串值)
下面是一个简单的示例,其中我使用jsoup DOM方法来解析我的主页并列出所有链接。
package com.theitroad.jsoup; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupExtractLinks { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("https://www.theitroad.local").get(); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); System.out.println("Text::"+linkText+", URL::"+linkHref); } } }
上面的程序产生以下输出。
Text::jQuery Popup and Tooltip Window Animation Effects, URL::https://www.theitroad.local/6998/jquery-popup-and-tooltip-window-animation-effects Text::Jobin Bennett, URL::https://www.theitroad.local/author/jobin Text::March 7, 2014, URL::https://www.theitroad.local/6998/jquery-popup-and-tooltip-window-animation-effects Text::jQuery, URL::https://www.theitroad.local/dev/jquery Text::jQuery Plugins, URL::https://www.theitroad.local/dev/jquery/jquery-plugins Text::Permalink, URL::https://www.theitroad.local/6998/jquery-popup-and-tooltip-window-animation-effects Text::Apache HttpClient Example to send GET/POST HTTP Requests, URL::https://www.theitroad.local/7146/apache-httpclient-example-to-send-get-post-http-requests Text::hyman, URL::https://www.theitroad.local/author/hyman Text::March 6, 2014, URL::https://www.theitroad.local/7146/apache-httpclient-example-to-send-get-post-http-requests Text::Java, URL::https://www.theitroad.local/dev/java Text::Permalink, URL::https://www.theitroad.local/7146/apache-httpclient-example-to-send-get-post-http-requests Text::Java HttpURLConnection Example to send HTTP GET/POST Requests, URL::https://www.theitroad.local/7148/java-httpurlconnection-example-to-send-http-getpost-requests Text::hyman, URL::https://www.theitroad.local/author/hyman Text::March 6, 2014, URL::https://www.theitroad.local/7148/java-httpurlconnection-example-to-send-http-getpost-requests Text::Java, URL::https://www.theitroad.local/dev/java Text::Permalink, URL::https://www.theitroad.local/7148/java-httpurlconnection-example-to-send-http-getpost-requests Text::How to integrate Google reCAPTCHA in Java Web Application, URL::https://www.theitroad.local/7133/how-to-integrate-google-recaptcha-in-java-web-application Text::hyman, URL::https://www.theitroad.local/author/hyman Text::March 4, 2014, URL::https://www.theitroad.local/7133/how-to-integrate-google-recaptcha-in-java-web-application Text::Java EE, URL::https://www.theitroad.local/dev/java/j2ee Text::Permalink, URL::https://www.theitroad.local/7133/how-to-integrate-google-recaptcha-in-java-web-application Text::JSF Spring Hibernate Integration Example Tutorial, URL::https://www.theitroad.local/7122/jsf-spring-hibernate-integration-example-tutorial Text::hyman, URL::https://www.theitroad.local/author/hyman Text::March 3, 2014, URL::https://www.theitroad.local/7122/jsf-spring-hibernate-integration-example-tutorial Text::Hibernate, URL::https://www.theitroad.local/dev/hibernate Text::JSF, URL::https://www.theitroad.local/dev/jsf Text::Spring, URL::https://www.theitroad.local/dev/spring Text::Permalink, URL::https://www.theitroad.local/7122/jsf-spring-hibernate-integration-example-tutorial Text::JSF Spring Integration Example Tutorial, URL::https://www.theitroad.local/7112/spring-jsf-integration Text::Oracle Webcenter Portal Framework Application – Modifying Home Page And Login/Logout Target Pages & Deploying Your Application Into Custom Portal Managed Server Instance, URL::https://www.theitroad.local/6938/oracle-webcenter-portal-framework-application-modifying-home-page-and-loginlogout-target-pages-deploying-your-application-into-custom-portal-managed-server-instance Text::JSF and JDBC Integration Example Tutorial, URL::https://www.theitroad.local/7068/jsf-database-example-mysql-jdbc Text::Count the Number of Triangles in Given Picture – Programmatic Solution, URL::https://www.theitroad.local/7064/count-the-number-of-triangles-in-given-picture-programmatic-solution Text::JSF Expression Language (EL) Example Tutorial, URL::https://www.theitroad.local/7058/jsf-expression-language-jsf-el Text::Read all Articles →, URL::https://www.theitroad.local/page/2
Jsoup选择器语法
我们还可以使用CSS或者jQuery之类的语法来查找和操作HTMl元素。
"文档"和"元素"包含我们可以用于此的" select(String cssQuery)"。
一些简单的例子是:
doc.select(" a"):从HTML返回所有" a"标记元素。
doc.select(c | if):查找<c:if>元素
doc.select("#id1"):返回所有ID =" id1"的标签
doc.select("。
cl1"):返回所有带有class =" cl1"的标签doc.select(" [href]"):返回所有带有href属性的标签
我们也可以组合选择器,您可以在Selectors API中找到更多详细信息。
现在来看一个示例,其中我将同时使用DOM和Selector API从我的中获取Google+作者URL。
package com.theitroad.jsoup; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupFindAuthor { public static void main(String[] args) throws IOException { //theitroad.local posts have author set as below //<div class="g-person" data-width="350" data-href="//plus.google.com/u/0/118104420597648001532" //data-layout="landscape" data-rel="author"></div> findAuthorUsingDOM(); findAuthorUsingSelector(); } private static void findAuthorUsingSelector() throws IOException { Document doc = Jsoup.connect("https://www.theitroad.local").get(); Elements authors = doc.select("div.g-person"); //selector combination for(Element author : authors){ System.out.println("Selector:: Author Google+ URL::"+author.attr("data-href")); } } private static void findAuthorUsingDOM() throws IOException { Document doc = Jsoup.connect("https://www.theitroad.local").get(); Elements authors = doc.getElementsByClass("g-person"); for(Element author : authors){ System.out.println("DOM:: Author Google+ URL::"+author.attr("data-href")); } } }
上面的程序打印以下输出。
DOM:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532 Selector:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532
jsoup修改HTML的示例
现在让我们看一下jsoup示例,在这里我将解析输入HTML并对其进行操作。
package com.theitroad.jsoup; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupModifyHTML { public static final String SOURCE_HTML = "<html><head><title>Jsoup Example</title></head>" + "<body><h1>Welcome to theitroad!!</h1><br " + "<div id=\"id1\">Hello</div>" + "<div class=\"class1\">hyman</div>" + "<a href=\"https://theitroad.local\">Home</a>" + "<a href=\"https://wikipedia.org\">Wikipedia</a>" + "</body></html>"; public static void main(String[] args) { Document doc = Jsoup.parse(SOURCE_HTML); System.out.println("Title="+doc.title()); //let's add attribute target="_blank" to all the links doc.select("a[href]").attr("rel", "nofollow"); //System.out.println(doc.html()); //change div class="class1" to class2 doc.select("div.class1").attr("class", "class2"); //System.out.println(doc.html()); //change the HTML value of first h1 element doc.select("h1").first().html("Welcome to theitroad.local"); doc.select("h1").first().append("!!"); //System.out.println(doc.html()); //let's make Home link bold doc.select("a[href]").first().html("Home"); System.out.println(doc.html()); } }
请仔细查看上述程序,以了解对输入html字符串所做的修改。
还要将其与最终文档进行比较,如下图所示。
Title=Jsoup Example <html> <head> <title>Jsoup Example</title> </head> <body> <h1>Welcome to theitroad.local!!</h1> <br> <div id="id1"> Hello </div> <div class="class2"> hyman </div> <a href="https://theitroad.local" target="_blank">Home</a> <a href="https://wikipedia.org" target="_blank">Wikipedia</a> </body> </html>
jsoup示例来解析Google搜索页面并查找结果
在结束本文之前,这里有一个示例,其中我解析google搜索结果的第一页并获取所有链接。
package com.theitroad.jsoup; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class ParsingGoogleSearch { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("https://www.google.com/search?q=java").userAgent("Mozilla/5.0").get(); //System.out.println(doc.html()); Elements resultsH3 = doc.select("h3.r > a"); for (Element result : resultsH3) { String linkHref = result.attr("href"); String linkText = result.text(); System.out.println("Text::" + linkText + ", URL::" + linkHref.substring(6, linkHref.indexOf("&"))); } } }
打印以下输出。
Text::Download Free Java Software, URL::=https://java.com/download Text::java.com: Java + You, URL::=https://www.java.com/ Text::Oracle Technology Network for Java Developers | Oracle ..., URL::=https://www.oracle.com/technetwork/java/ Text::Java (software platform) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(software_platform) Text::Java (programming language) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(programming_language) Text::Java Tutorial - TutorialsPoint, URL::=https://www.tutorialspoint.com/java/ Text::Welcome to JavaWorld.com, URL::=https://www.javaworld.com/ Text::Java.net: Welcome, URL::=https://www.java.net/ Text::News for java, URL::h?q=java Text::Javalobby | The heart of the Java developer community, URL::=https://java.dzone.com/
请注意,当前的google搜索结果是h3标签中" r"类的一部分,显然" a"用于链接。
因此,如果将来有任何更改,例如h3标签类名称已更改,则它将无法正常工作,我们将不得不通过查看源html结构来进行一些修改。