Tuesday, 30 May 2017

Create a better images downloader(Google, Bing and Yahoo) by Qt5

  I mentioned how to create a simple Bing image downloader in Download Bing images by Qt5, in this post I will explain how do I tackle the challenges I have encountered when I try to build a better image downloader app by Qt5, the skills I used are apply on QImageScraper version_1.0, if you want to know the details, please dive into the codes, they are too complicated to write down in this blog.

1 : Show all of the images searched by Bing

  To show all of the images searched by Bing, we need to make sure the page is scrolled to the bottom, unfortunately there is no way to check it with 100% accuracy if we are not scrolling the page manually, because the height of the scroll bar keep changing when you scrolling it, this make the program almost impossible to determine when should it stop to scroll the page.

Solution

  I give several solutions a try but none of them are optimal, I have no choice but seek a compromise. Rather than scrolling the page full auto, I adopt semi auto solution as Pic.1 shown.

Pic.1

2 : Not all of the images are downloable

  There are several reasons may cause this issue.


  1. The search engine(Bing, Yahoo, Google etc) fail to find direct link of the image.
  2. The server "think" you is not a real human(robot?)
  3. Network error
  4. There are no error happen, but the reply of the server take too long
  Solution 

  Although I cannot find a perfect solution for problem 2, but there are some tricks to alleviate it, let the flow chart(Pic.2) clear the miasma.

Pic.2

Pic.3


  Simply put, if error happen, I will try to download thumbnail, if even the thumbnail cannot download, I will try next image. After all, this solution is not too bad, let us see the results of download 893 smoke images search by Google.

Pic.4


  All of the images could be downloaded, 817 of them are big images, 76 of them are small images, not a perfect results but not bad either. Something I did not mentioned in Pic.2 and Pic.3 are 

  1. I always switch user agents
  2. I start next download with random period(0.5second~1.5second)
  Purpose of these "weird" operations is try to emulate behaviors of humans, this could lower down the risk of being treated as "robot" by the servers. I cannot find free, trustable proxies yet, else I would like to randomly connect to different proxies time to time too, please tell me where to find those proxies if you know, thanks.


3 : Type of the images are mismatch or did not specify in file extension

  Not all of the images have correct type(jpg, png, gif etc), I am very lucky that Qt5 provide us QImageReader, this class can determine the type of the image from contents rather than extension. With it we can change the suffix of the file into the real format, remove the files which are not images.

4 : QFile fail to rename/remove file

  QFile::rename and QFile::remove got some troubles on windows(works well on mac), this bug me a while, it cost me one day to find out QImageReader blocking the file

5 : Invalid file name 

  Not all of the file name are valid, it is extremely hard to find out a perfect way to determine the file name is valid or not, I only do some minimal process for this issue--Remove illegal characters and trimmed white spaces.

6 : Deploy app on major platforms

  One of the strong selling points Qt is the ability of cross-platform, to tell you the truth I can build the app and run it on windows, mac and linux without changing single line of codes, it work out of the box. Problem is, deploy the app on linux is not fun at all, it is a very complicated task, I will try deploy this image downloader after linuxqtdeploy become mature.

Summary

  In this blog post, I reviewed some problems I met when using Qt5 to develop an image downloader, this is by no means exhaustive but scrape the surface, if you want to know the details, every nitty-gritty, better dive into source codes.

Download Bing images by Qt5
Source codes of QImageScraper

Sunday, 14 May 2017

Download Bing images by Qt5





  Have you ever need more data for your image classifier?I do, but download images search by Google, Bing nor Flickr one by one are very time consuming, why not we write a small, simple images scraper to help us? Sounds like a good idea, as usual, before I start the task, I list out the requirements of this small app.

a : Cross platform, able to work under ubuntu and windows with one code base(no plan for mobiles since this is a tool design for machine learning)
b : Support regular expression, because I need to parse the html
c : Support high level api of networking
d : Have decent webEngine, it is very hard(impossible?) to scrape the images from those search engine without it
e : Support unicode
f : Easy to create ui, because I want instant feedback of the website, this could speed up development times
g : Ease to build, solving dependency problem of different 3rd libraries are not fun at all

  After search through my toolbox I find out Qt5 is almost ideal for my task. In this post I will use Bing as an example(Google and Yahoo images share the same tricks, processes of scraping these big 3 image search engine are very similar). If you ever try to study the source codes of the search results of Bing, you will find out they are very complicated, difficult to read(Maybe MS spend lots of time to prevent users scrape images). Are you afraid?Rest assured, the steps of scraping image from Bing is a little bit complicated but not impossible as long as you have nice tools to aid you :).

Step 1 : You need a decent, modern browser like firefox or chrome

  Why do we need a decent browser?Because they have a powerful feature--Inspect Element, this function can help you find out the contents(links, buttons etc) of the website. 

Pic1


Step 2 : Click Inspect Element on interesting content

  Move your mouse to the contents you want to observe and click Inspect Element.


Pic2


After that the browser should show you the codes of the interesting content.

Pic3

The codes point by the browser may not something you want, if this is the case, look around the codes point by the browser as Pic3 show.

Step 3 : Create a simple prototype by Qt5

  We already know how to inspect the source codes of the web page, let us create a simple ui to help us. This ui do not need to be professional or beautiful, after it is just a prototype. The functions we need to create a Bing image scraper are

a : Scroll pages
b : Click see more images
c : Parse and get the links of images
d : Download images

  With the help of Qt Designer, I am able to "draw" the ui(Pic4) within 5 minutes(ignore parse_icon_link and next_page buttons for this tutorial).


Pic4

Since Qt Designer do not QWebEngineView yet, I add it manually by codes

    ui->gridLayout->addWidget(web_view_, 4, 0, 1, 2);

  Pic5 is what it looks like when running.

Pic5


Step 4 : Implement scroll function by js

    //get scroll position of the scroll bar and make it deeper
    auto const ypos = web_page_->scrollPosition().ry() + 10000;
    //scroll to deeper y position
    web_page_->runJavaScript(QString("window.scrollTo(0, %1)").arg(ypos));

Step 5 : Implement parse image link function 

 
  Before we can get the full link of the images, we need to scrape the links of the page.

    web_page_->gttoHtml([this](QString const &contents)
    {             
        QRegularExpression reg("(search\\?view=detailV2[^\"]*)");
        auto iter = reg.globalMatch(contents);
        img_page_links_.clear();
        while(iter.hasNext()){
            QRegularExpressionMatch match = iter.next();            
            if(match.captured(1).right(20) != "ipm=vs#enterinsights"){
                QString url = QUrl("https://www.bing.com/images/" + 
                                   match.captured(1)).toString();
                url.replace("&amp", "&");
                img_page_links_.push_back(url);
            }
        }
    });

Step 6 : Simulate "See more image"

  This part is a little bit tricky, I tried to find the words "See more images" but find nothing, the reason is the source codes return by View Page Source(Pic 6) do not update.

Pic 6

  Solution is easy, use Inspect Element to replace View Page Source(sometimes it is easier to find the contents you want by View Page Source, both of them are valuable for web scraping).



    
    web_page_->runJavaScript("document.getElementsByClassName"
                             "(\"btn_seemore\")[0].click()");


Step 7 : Download images

  Overall our ultimate goal is download the image we want, let us finish last part of this prototype.

  First, we need to get the html text of the image page(the page with the link of image source).


    if(!img_page_links_.isEmpty()){
        web_page_->load(img_page_links_[0]);
    }

  Second, download the image
    
    void experiment_bing::web_page_load_finished(bool ok)
    {
        if(!ok){
          qDebug()<<"cannot load webpage";
          return;
        }

        web_page_->toHtml([this](QString const &contents)
        {
            QRegularExpression reg("src2=\"([^\"]*)");
            auto match = reg.match(contents);
            if(match.hasMatch()){           
               QNetworkRequest request(match.captured(1));
               QString const header = "msnbot-media/1.1 (+http://search."
                                      "msn.com/msnbot.htm)";
               //without this header, some image cannot download
               request.setHeader(QNetworkRequest::UserAgentHeader, header);                      
               downloader_->append(request, ui->lineEditSaveAt->text());
            }else{
               qDebug()<<"cannot capture img link";
            }
            if(!img_page_links_.isEmpty()){
               //this image should not download again
               img_page_links_.pop_front();
            }
        });
    }

  Third, download next image


    
void experiment_bing::download_finished(size_t unique_id, QByteArray)
{
    if(!img_page_links_.isEmpty()){
        web_page_->load(img_page_links_[0]);
    }
}

 Summary

  These are the key points of scraping images of Bing search by QtWebEngine, The downloader I use in this post are come from qt_enhance, whole prototype are placed at mega. If you want to know more, visit following link

Create a better images downloader(Google, Bing and Yahoo) by Qt5

Warning

  Because Qt-bug 66099, this prototype do not work under windows 10, unfortunately this bug is rated as P2, that means we may need to wait a while before it can be fixed by Qt community.

Edit

  Qt5.9 Beta4 fixed Qt-bug 60669, it works on my laptop(win 10 64bits) and desktop(Ubuntu 16.04.1).

  There exist a better solution to scrape image link of Bing, I will mentioned it on the next post.