Organizing, Cleaning, and Manipulating Data

I began the cleaning phase of the project thinking about potential visualizations for the data. I wanted to showcase the photographs in as many layers and perspectives as possible – each column of the data needed to be as specific as possible. For example, I sought to organize the dataset not only location, but by city and country. Furthermore, I had four UN Photo Library search term scrapes and the WHO disability scrape, and needed to ultimately consolidate them into a single CSV file, taking into account overlapping search results.

OpenRefine

OpenRefine provided a quick way to clean messy data, and was used mainly to separate columns that displayed multiple pieces of data (e.g. location).

In order for a mapping system like Carto and a future visualization to work, I needed to make sure I had distinct columns that listed locations that could be geotagged – the raw CSV file had one column for location, which usually listed both a city and a country, separated by a comma. Because the locations on the UN Photo Library pages were consistently listed in the “City, Country” format, and a number of photos also listed “United Nations” as a location marker, I set out to reorganize each location into a comma-separated entry: city, country, and location 3. Location 3 would provide a space to keep a note of locations like “United Nations” that are neither a city nor a country. As OpenRefine would allow me to separate the location entries into three columns based on commas, I used a text editor to ensure each entry went in the right column. “United Nations, New York” was edited to read “New York, United States, United Nations” for the sake of geotagging. “Nairobi, Kenya” became “Nairobi, Kenya,” with an extra comma to account for what would later be the third location column. Locations that listed only a city or a country had commas placed accordingly: “Lebanon” became “, Lebanon,” and “Jerusalem” “Jerusalem, ,”.

The same was done for the photograph credit. The UN Photo Library listed credit usually listed two to three credits, each separated by a slash: “Primary Organization/Secondary Organization/Photographer Name.” I standardized each entry to three credits, adding additional slashes for empty data points.

FileMaker Pro

In order to take into account overlapping search results from the four scrapes, a separate column for “keyword” that listed one of the four search terms (disability, handicap, retard, cripple) was added. This was to ensure that when the CSVs were consolidated into a single database on FileMaker, and repeated photograph entries were deleted, I could keep track of how many keywords those photographs showed up in. Furthermore, in order to differentiate between the WHO and UN scrapes, another column was added to indicate originating organization. Once this process was complete, I exported all the files to FileMaker Pro.

FileMaker allowed me to define relationships between images and keywords, and between images and WHO archival files on photographers and subjects. Furthermore, I could consolidate all the different CSV files in one spot without having to shift between programs to organize and clean data. This was crucial, as I was essentially creating the database for disability that the UN and WHO should have had on their websites.

The biggest problem I encountered was that I wasn’t getting all of the records from the imported CSV files – the import had done something unexpected. To remedy this, I created a validation on the column “Photo #” so that it was unique and always had to be validated during import. Existing records could be rejected on import. Moreover, by setting up a keyword field, I could ensure that photos had one to many relationships with keywords. This connection was based on Photo #.

Ultimately, I was left with a keyword field that listed the different keywords each photograph showed up in. The formatting of the keywords did pose a slight problem when the consolidated dataset was exported as a CSV. Keywords were formatted so that they were listed individually on different rows, rather than in a single cell under the keyword column, separated by commas. A new field was created in FileMaker in a comma-separated format to fix this issue.

Final Clean

After an initial look through the consolidated data, issues with encoding were the most visually apparent. I had made sure to export as a CSV from FileMaker in Unicode-8, yet special characters such as accents and em-dashes were still rendered incorrectly. As a precaution, I opened the data in Google Sheets, as character encoding on Excel is known to be unreliable. However, the same issues persisted, indicating that encoding errors occurred during the scrape, as Web Scraper attempted to recognize discrepancies on the website. I made the necessary edits using regular expressions on a search and replace function on a text editor, and copied the cleaned entries onto the CSV files.

There were also a few instances where the WHO had made some bookkeeping errors, where certain names were misspelled, and random spaces were inserted in between letters. For this, I used OpenRefine again, doing a cluster and edit on the cells to keep names and spellings consistent.

Cleaned CSV datasets are provided here for download.