🌁🤖 Content similarity image clusters (with computer vision), annotated
Identifying thematic clusters inside a (rather large) image collection.
This approach helps to cluster and visualise images in a collection, according to how machine learning algorithms classify their content. This can be used to identify thematic visual clusters inside a collection of images as well as to quantify them. It is similar to a co-hashtag analysis, but undertaken with visual content. In practice, with the help of computer vision one generates tags for each image and then uses shared tags to visually cluster similar images. There are four main phases. First, images are tagged with help of a computer vision API. Second, images are downloaded and saved locally. Third, a network of images and tags is built and visualized in Gephi. Finally, images are loaded into the network and exported. The process ends with annotation of clusters on Vector.com.
🧱 Inputs from TCAT
- “Media frequency”
- Or “Export all tweets from selection” → column “from_user_profile_image_url”
Tag images with computer vision API
- Open data with Google Spreadsheet
- Export csv with URLs list from Google Spreadsheet
- Import URLs list in image tagging tool) (be sure to get your own Clarifai API key and paste it into the tool)
- Run tagging (click button: process input file)
Download images locally and resize them
You can do this, for example, using a browser extension or using the command line.
Using browser extension
- Install Tab Save Chrome extension
- Copy and paste URLs list in Tab Save
- Download images from URLs list with Tab Save
- Go to Bulk Resize Photos
- Drag images
- Resize by 50%
- Unzip folder
Using command line
An alternative approach to downloading images is to use the command line.
- First you’ll need to install wget a tool for downloading files using HTTP and other protocols. How you do this will depend on your operating system, your command line interface and your package manager. For example,
- If you’re on a Mac you can install a package manager such as Homebrew and you can use
$ brew edit wget
- If you’re on Linux you can use
$ apt-get install wget
- Put the image URLs into a single csv file
- Create a folder where you’d like the images to go and navigate to the folder using the command line
- You can use
ls to list the files at your location and
cd to change to a given directory
- You can download the files listed in the csv file using the command
wget -i [path to the csv file] --show-progress
- For further details and other options see the wget manual
- If this works you should have a folder full of images. 🎏
- If you’d like to create a text file with the images which are in the folder (e.g. to check which have downloaded or to add file names to a dataset) you can use
ls -1 >> [name of your file.csv] to generate another csv files with the names of all the files which have successfully downloaded.
Prepare edges table for Gephi
- Import csv output of Image tagging tool in Google spreadsheet
- Rename headers: url → Source; concept → Target; confidence → Weight
- Export edges csv from Google Spreadsheet
Prepare nodes table for Gephi
Note: you can also use Table2Net to create graph files from csv files, but here we will do this manually to update image locations.
- Copy “Source” column in a new sheet
- Rename column as “Id”
- Make a copy of “Id” column into a new column
- Rename the new column as “image”
- In the new column “Image”, transform URLs strings into file names (see below for options)
- Export node csv from Google Spreadsheet
Adding image file names with find and replace
- Sort column “Image” alphabetically
- There are different types of urls, but all of them ends with the image name. The goal is to delete every character before the image name with the Find and Replace tool. Proceed by group of similar URLs.
- Select column “Image”
- Edit → find and replace
- Find a string such as
https://pbs.twimg.com/media/ and replace with nothing (be sure that you are doing the search only in a “specific range”, which in this case is the column “Image”)
- Repeat for each type of URL, until you only have image names and no URLs
Adding file names with VLOOKUP
- Create new sheet called “urls” and copy and paste the image URLs into it.
- Create a new ‘Named range’ by clicking on the “Data” menu and then “Named ranges” then “Add a range” from the right hand menu, enter the name “url_list” and select the urls in the sheet and click “Ok” and “Done”
- Create new sheet called “images” and copy and paste the downloaded file names into it (if you used wget method above you can use
ls -1 >> [name of your file.csv] to get a list of the downloaded files)
- Use the VLOOKUP function to find the URLs associated with each of the images in the “images” sheet by using the following formula next to the first cell on the sheet
=VLOOKUP("*"&A1,url_list, 1,FALSE). You can double click the small square in the bottom of the cell or drag down to lookup the rest of the images in the sheet.
- Copy the table of images and associated urls and paste into a new sheet (e.g. “image_urls_values”) using “Edit” > “Paste special” > “Paste values only”. Create a new named range (using the same process described above) by selecting these values and naming them “image_url_table”.
- In the “nodes” sheet, create a new column called “Image” and use VLOOKUP again to find the associated URLs with the following formula
=VLOOKUP(A2,image_url_table, 2,FALSE). You can double click the small square in the bottom of the cell or drag down to lookup the rest of the images in the sheet.
Import network and visualize clusters with Gephi
- Open Gephi
- Download and install “Image preview” plugin
- Data laboratory → import spreadsheet
- Import edges table
- Data laboratory → import spreadsheet
- Import nodes table (be sure to have checked: “append to existing workspace”)
- Resize nodes based on “out-degree”
- Spatialise network with Force Atlas 2
Export image from Gephi and annotate
- In the Finder, find one image file
- Find path (on Mac: command + i)
- Copy path
- Go to Gephi → Preview window
- Select “Render nodes as images”
- In the field “Image Path”: paste image path
- Set nodes opacity to 0
- Deselect “show edges”
- Click “Refresh” to generate image network
- Export png
- Import png in Vectr.com
- Annotate custers
Note: if the images don’t show up and/or the nodes continue to appear behind the images you may have to turn opacity to 100 in “Preview Settings” of the “Preview” panel and then change the node colour to white by going to “Overview” > “Appearance” > “Nodes” > “Unique” and selecting white (#ffffff) as the node colour. Upon refreshing and exporting the network you should see just clusters of images, without nodes or edges.
🐙 Inspiration, acknowledgments and contributors
This and other visual methods recipes were originally formulated by Gabriele Colombo drawing on his doctoral work exploring the design of composite images. They were documented and refined for a module on Digital Methods for Internet Studies: Concepts, Devices and Data convened by Liliana Bounegru and Jonathan Gray at the Department of Digital Humanities, King’s College London, leading to a set of collaborative group projects with their students and the European Forest Institute. The approaches behind these recipes draw on several years of experimentation with images in the context of research and teaching at the Visual Methodologies Collective (Amsterdam University of Applied Sciences), the Digital Methods Initiative (University of Amsterdam), DensityDesign Lab (Politecnico di Milano), the médialab (Sciences Po, Paris) and beyond. You can read more about these approaches in Colombo, 2019 and Niederer & Colombo, 2019. Further readings can be found in the visual methods Zotero bibliography.