Scraping Data

Scraping data from the United Nations and WHO Photo Libraries online represented the single most difficult task throughout the entire process. Each website is outdated in its archival format; furthermore, its accompanying photograph data is often not standardized. This required multiple scraping attempts before readable CSV files and full thumbnail downloads of the images could be achieved.

UN Photo Library

The UN Photo Library’s search results are displayed in a tile format of image thumbnails and associated titles. Clicking on the tiles leads to a page with more detailed information on the selected photograph. This individual content is generated dynamically by Javascript.

I sought to scrape the following information from each photograph:

Title of Photo
Description (Caption)
Date
Location
Photo Number
Credits
Link

The UN Photo Library website’s search function is limited, and doesn’t provide much room to organize the results within the site before scraping. Organizing the search results is limited to relevance, and ascending and descending dates, and no function exists to view more than 18 photographs per page. Thus, a web-scraping job needed to take into account pagination, and know to continue to the following pages.

This was where my first attempt, using Grepsr, a simple Chrome extension, failed. While Grepsr could recognize the text fields (e.g. title, description, date), it could not recognize the “next” button on either the main search page or the individual photo page. Thus, the scraper failed to scrape data from all the search results, and ended either after the first 18 results, or following multiple extractions of the same page.

Web Scraper, another Chrome extension, proved to be a more powerful tool for the project. The scraper, perhaps most importantly, allowed me to build a sitemap. By manipulating the start URL, I was able to direct the scrape to take into account pagination when navigating the site, and defined the total number of results and the number of photos per page. The following start URL was used for the search term “disability”:

http://www.unmultimedia.org/photo/gallery.jsp?query=disability&startat=[0-663:18]

There were 663 results, and 18 images per page. The scrape was defined to start at 0. Following this first step, I used different selectors to extract multiple types of data - text and links in the form of a CSV file, and images in a single folder.

I repeated this process three more times to scrape data from the other search terms. As of writing, there were 663 results for “disabled,” 62 for “handicap,” 17 for “crippled,” and 8 for “retard.” As I was dealing with a few hundred search results and photo downloads, I ran the scraper on a virtual machine – this allowed me to leave the scrape running and come back to it later.

The full sitemaps I generated are provided below:

Search "disability":

Search "handicap":

Search "cripple":

Search "retard":

WHO Photo Library

Scraping data from the WHO Photo Archives website proved excruciatingly complex and ugly. I primarily focused on the photographs for the keyword search “disabled.” The search results are presented in a similar tile format, and the site does offer more options in sorting and changing results per page. Yet, the thumbnails’ tendency to expand when the cursor moves over it makes navigating the website difficult.

But perhaps more alarmingly, the WHO Photo Library lacks alt text, which provides greater accessibility for those who use a screen reader due to a visual impairment. This is extremely problematic, given the WHO’s role as an international organization concerned with public health.

The biggest challenge for scraping data is hidden in the URL for the search result:

https://extranet.who.int/photolibrary/desktop.htm?session_id=ifqCtjs.pFTwz&language_id=eng

As the session_id= in the URL indicates, every search on the archive operates through cookies. This means that neither the URL to a specific photograph in the archive, nor the search results themselves, can be shared. The session is unique to the computer accessing it – the page shown is saved as a cookie, and is not replicable unless another computer had the same exact cookies saved. Everything in the WHO search that would normally be done through a URL and a server – with Javascript to show content – is done through a process where information is stored locally on the user's computer by the user's web browser. Thus, the techniques used for the UN Photo Library do not work here – there is no start URL to manipulate. While fields can be selected, there’s no way to account for pagination.

Scraping for the WHO Photo Library ultimately required programming language, and I used Python to successfully obtain my data. I’ve included a notebook detailing the process here.

After I was able to get a CSV, I noticed that the WHO’s photograph data was a significantly less organized and less standardized than that of the UN. There were much more categories and datapoints on each page, and the following list was extracted through the scrape:

Approximate date
Caption
City
Consent form available
Country
Country related
Credit
Date
Date of exposure
Headline
Keywords
Location
People
Related information
Title
URL links
WHO Regions
id
html
img