Re-imaging the empirical: statistical visualisation in art and science

UMAP grid image

Project Description

Re-imaging the empirical is a research project investigating the visual cultures of machine learning (ML). We are interested in the dominant role that images play in many contemporary ML projects and endeavours from AlphaGo through to style transfer. We have been inquiring into how images are used by ML models and techniques as part of a broader re-contouring of what it is to both see and know the empirical world. We use ML and dataset methods in this project drawn from scientific scholarship – specifically the pre-print repository arXiv – to detect vectors and differences across scientific images; images that have themselves been generated by ML research in statistics, physics, mathematics, computer vision and more.

Research Outputs

The project has three threads:

Tracing the genealogies and interrelations of images generated as a result of ML research - see Images of the arXiv: reconfiguring large scientific datasets published in the Journal of Cultural Analytics.
Developing a new theoretical framework for thinking changes in seeing, images, and perception in a culture(s) of ML - see Platform Seeing: Images and their Invisualities published in Theory, Culture & Society
Building a visual explorer web application that foregrounds new ways of seeing and perceiving shaped by ML computation - see ImageMesh, a practical outcome of this project and an ongoing research tool for exploring a large sample of images published in arXiv articles. We encourage you to follow the errant paths these images generate as they compose and are plied by new modes and practices of machinic observation.

ImageMesh Visual Explorer screenshot
ImageMesh Visual Explorer

Project Outline

This repository

This GitHub repository contains the code used for the various parts of this project. This includes downloading and extracting the bulk source data from arXiv, cleaning and organising metadata, converting images, visualising slices of the dataset, running classification algorithms, encoding nearest neighbours maps, and generating images using GANs. As such, it is quite a varied and large repository, but it is hoped that some of the scripts and notebooks will be useful for projects accessing arXiv or using ML techniques on large image datasets.

Code and documentation structure

This repository contains code, statistics, and images produced throughout the project. These materials are mostly concerned with looking at the dataset of all the images, text, and metadata contained within the arXiv source files.

For detailed instructions on running the code, please look in the methods folder:

detailed instructions on downloading and extracting the arXiv bulk data and OAI metadata in dataset-methods
step-by-step methods for organising metadata into an SQLite database in sqlite-method
process of converting varied image formats to consistent size jpg images in image-conversion (additional examples in the image-conversion folder)
basic information about our computer setup
examples and explanation of project documentation
tool for quick human labelling of images

Code is written using bash, Python, SQLite, jupyter notebooks, and anaconda. Tested on Ubuntu 18.04 with an Intel CPU and NVidia graphics card.

Steps for downloading, extracting, and converting arXiv images dataset:

Downloading and extracting the arXiv source dataset - see arxiv_extract.sh
Creating an SQLite database to index each image and link to metadata - sqlite-scripts
Querying dataset for various statistics regarding image formats, dimensions, category distributions etc. - see statistics, e.g. general data statistics
Preparing data by converting all images to uniform sizes using ImageMagick convert - see the image-conversion folder and in particular convert_images_from_textfile_threaded.py

Additional processes:

Running machine learning techniques such as image classification and dimensionality reduction across data
Producing t-SNE maps of the distribution of different image features within subsets of the data
Generating images using the image dataset using generative adversarial networks

Image Samples

Montage of 144 images sampled randomly from the entire arXiv image dataset, images
have been resized to fit within a 240x240 pixel square. Here we see a diverse collection of images.

Montage of 144 images sampled randomly from the entire arXiv image dataset, images have been resized to fit within a 240x240 pixel square. Here we see a diverse collection of images. Image credits.

Stackplot of image file extensions for all arXiv preprint submissions by year.

Relative number of articles per arXiv primary category. Only categories with article counts > 1000 shown.

Percentage of articles published in a given category appearing in each year 1991-2018.

Number of images published per year in each category. Ordered by total images in a category, largest to smallest, top-left to bottom right. Top 16 categories only shown here.

Average number of images per article by all arXiv categories and years of submission to 2018. Y axis has been scaled to ignore outliers. Arranged in alphabetical order, refer to arXiv for a list of categories http://arxitics.com/help/categories.

Ratio of diagram/sensor/mixed image classifications predicted using custom ternary classifier. Maximum of 2000 images sampled from any given category-year combination. Categories shown are hand selected.

t-SNE map of 1000 images from arXiv, organised by features extracted from VGG classifier

t-SNE map of images with the primary category of cs.CV (computer science, computer vision) from 2012 from arXiv, organised by features extracted from VGG classifier

People

Professor Anna Munster, UNSW Art & Design
Professor Adrian Mackenzie, Australian National University
Kynan Tan, Postdoctoral Fellow, UNSW Art & Design

Acknowledgements

This project has been supported by an Australian Research Council Discovery Grant.

Thank you to arXiv for use of its open access interoperability.

Licence

This project is licensed under the terms of the GPL licence. See GPL-3.0-or-later