How do I extract the "href" attributes from a HTML document with select.rs?


Keywords:http 


Question: 

I am trying to write a very basic crawler. After receiving an HTTP response, I am using the select.rs crate to extract the URLs from the body for further crawling.

How can I extract these URLs from from the "document" which is the "body" part of the HTTP response using the "for-iteration"?

extern crate hyper;
extern crate select;
extern crate xhtmlchardet;
extern crate robotparser;
extern crate url;

use std::io::Read;
use Crawler::hyper::client::Client;
use Crawler::hyper::header::Connection;
use Crawler::select::document::Document;
use Crawler::select::predicate::*;

pub fn crawl(url: &str) {

    //Opens up a new HTTP client
    let client = Client::new();

    //Creates outgoing request
    let mut res = client.get(&*url)
        .header(Connection::close())
        .send().unwrap();

    //Reads the response
    let mut body = String::new();
    res.read_to_string(&mut body).unwrap();

    println!("Response: {}", res.status);
    println!("Headers:\n{}", res.headers);
    println!("Body:\n{}", body);


    let document = Document::from_str(&*body);

    for node in document.find(Attr("id", "hmenus")).find(Name("a")).iter() {
        println!("{} ({:?})", node.text(), node.attr("href").unwrap());
    }
}

The result of executing crawl for a URL like "um.ac.ir" is a full HTTP response with a body. I am trying to extract the hrefs from this output.

Response: 200 OK
Headers:
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
Cache-Control: cache
Date: Tue, 27 Feb 2018 13:16:27 GMT
Vary: Accept-Encoding
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Transfer-Encoding: chunked
Pragma: no-cache
Server: GFW/2.0
Connection: close
Content-Type: text/html; charset=utf-8
Strict-Transport-Security: max-age=63072000; preload
Set-Cookie: POSTNUKESID=pnd2nuadgastqak5h6nop87c63; path=/

...

<div class="col-md-4">
    <h3>سایر</h3>
    <ul> 
        <li><a target="_blank" href="">سایت خبری ftp دانشگاه</a></li>
        <li><a target="_blank" href="">گزینش دانشگاه </a></li>
        <li><a target="_blank" href="">مدیریت حراست دانشگاه </a></li>
        <li><a target="_blank" href="">مركز آثارمفاخر و اسناد دانشگاه</a></li>
        <li><a target="_blank" href="">مدیریت همكاری های علمی و بین المللی</a></li>
        <li><a target="_blank" href=""> مدیریت نظارت و ارزیابی دانشگاه</a></li>
        <li><a target="_blank" href="">سایت سایبان مهر</a></li>
        <li><a target="_blank" href="">بنیاد دانشگاهی فردوسی</a></li>
        <li><a target="_blank" href="">آگهي ها و تبليغات دانشگاه</a></li>
        <li><a target="_blank" href="">سامانه مدیریت وبلاگ</a></li>
        <li><a target="_blank" href="">بسیج اساتید</a></li>
        <li><a target="_blank" href="">بسیج كاركنان</a></li>
        <li><a target="_blank" href="">نهاد نمایندگی رهبری در دانشگاه</a></li>
    </ul> 
</div>

...   

The problem is that the println!("{} ({:?})", node.text(), node.attr("href").unwrap()) doesn't output anything since the [...].iter() is not working correctly:

for node in document.find(Attr("id", "hmenus")).find(Name("a")).iter() {
        println!("{} ({:?})", node.text(), node.attr("href").unwrap());
    }

It seems that find(Attr("id", "hmenus")).find(Name("a")) isn't the right way for finding "href" tags from the body of the HTTP response.

I believe rewriting this part should fix the problem in my code, although it requires an overall knowledge of how select::document is working.


1 Answer: 

I assume you copied the Attr("id", "hmenus") from some example code. This is a filter predicate that matches an HTML node containing attribute id="hmenus". Your example page um.ac.ir does not contain any nodes with attribute id="hmenus". If you want the crawler to find all <a> nodes on the page, the filter predicate would be Name("a").

for node in document.find(Name("a")).iter() {
    if let Some(href) = node.attr("href") {
        println!("{} ({:?})", node.text().trim(), href);
    }
}