HTMLAgilityPack 是一个开源的.NET库,旨在帮助开发人员处理和操作HTML文档。它提供了解析HTML文档、查询DOM元素以及修改HTML内容的功能。HTMLAgilityPack 基于XPath和LINQ查询,使得开发者能够以类似于操作XML文档的方式来操作HTML文档。这使得从复杂的HTML结构中提取所需数据变得轻而易举。
以下是一些常用的HTMLAgilityPack方法和属性,以及它们的用途:
我们使用 HttpClient
发送一个 GET 请求到指定的 HTTPS URL,并且读取返回的响应内容。
如果出现 HTTP 状态码 403 (Forbidden) 错误表示您的请求被服务器拒绝,通常是因为服务器认为您没有权限访问该资源。
HttpClient
的 DefaultRequestHeaders.Authorization
属性来添加身份验证标头。C#private async void btnGetTitle_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string htmlContent = "";
using (HttpClient httpClient = new HttpClient())
{
try
{
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3");
HttpResponseMessage response = await httpClient.GetAsync("https://www.baidu.com");
//检查 HTTP 响应的状态码是否表示成功
response.EnsureSuccessStatusCode();
//读取内容
byte[] bytes = await response.Content.ReadAsByteArrayAsync();
htmlContent = Encoding.UTF8.GetString(bytes);
}
catch (HttpRequestException ex)
{
}
}
doc.LoadHtml(htmlContent);
HtmlNode titleNode = doc.DocumentNode.SelectSingleNode("//title");
if (titleNode != null)
{
string title = titleNode.InnerText;
MessageBox.Show($"页面标题:{title}");
}
}
C#/// <summary>
/// 通过url取得html内容
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
private async Task<string> GetHtml(string url)
{
string htmlContent = "";
using (HttpClient httpClient = new HttpClient())
{
try
{
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3");
HttpResponseMessage response = await httpClient.GetAsync(url);
response.EnsureSuccessStatusCode();
//读取内容
byte[] bytes = await response.Content.ReadAsByteArrayAsync();
htmlContent = Encoding.UTF8.GetString(bytes);
}
catch (HttpRequestException ex)
{
}
}
return htmlContent;
}
private async void btnGetLinks_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string htmlContent =await GetHtml("https://www.baidu.com");
doc.LoadHtml(htmlContent);
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
string link = linkNode.GetAttributeValue("href", "");
lstLink.Items.Add(link);
}
}
}
C#private async void btnGetSpecialLink_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string htmlContent = await GetHtml("https://news.baidu.com/");
doc.LoadHtml(htmlContent);
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//*[@id=\"pane-news\"]/ul/li[@class=\"bold-item\"]/a");
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
string link = linkNode.GetAttributeValue("href", "");
string title = linkNode.InnerText;
lnkSpecialLink.Items.Add(title + " " + link);
}
}
}
快速找到节点path
C#private async void btnLinqSearch_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string htmlContent = await GetHtml("https://news.baidu.com/");
doc.LoadHtml(htmlContent);
IEnumerable<HtmlNode> linkNodes = doc.DocumentNode.Descendants("li")
.Where(div => div.Attributes["class"]?.Value.Contains("bold-item") == true);
foreach (HtmlNode linkNode in linkNodes)
{
string title = linkNode.InnerText;
lnkSpecialLink.Items.Add(title);
}
}
本文作者:技术老小子
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!