Experiments in Synthetic Data

6 November, 2018

This post was written by Frank Longford, who worked with Forensic Architecture on a 6-week ASI data science fellowship.

We use machine learning at Forensic Architecture in a range of ways to enhance our data gathering capacity during open-source investigations.

In many investigative contexts, the abundance of potentially relevant data presents a significant challenge to constructing a case. To this end, we use machine learning tools where appropriate to channel investigative resources effectively when processing footage from YouTube, Vimeo, or other media sources.

One way to process incoming footage is to flag videos that likely contain objects of interest. Using methods in algorithmic classification, we can index media content and sort it accordingly. For example, we are often interested in determining which frames in a video contain military hardware.

Tanks are a distinct symbol of an escalation of violence and also a good indicator that a nation state may be involved in a regional conflict. They also have a distinctive shape and size, which renders them more detectable to algorithmic classification.

In order to train machine learning classifiers, a large dataset containing the object of interest is usually necessary. As an office of architects and animators, we have experience generating images using simulated objects and environments. The rest of this post details experiments we have been running to supplement a dataset of real images with synthetic counterparts.

Generating Tanks

When we lack a training dataset for a particular object of interest, we can supplement it by digitally modelling the object and ‘photographing’ it in a simulated environment. Parametrising the object and its environment allows for infinite variations of output data. Variation is an important characteristic of training data, to ensure that classifiers do not overfit.

Pipeline Overview

Our pipeline is set up in MAXON’s Cinema 4D (C4D). This 3D modelling and animation software gives us control over animation parameters through its node-based editor XPresso. The node-based approach to parametrising various variables can be replicated in other software with similar pipelines (Blender, Unity etc.). We have developed a modular file set-up in which we can easily interchange objects of interest to create synthetic datasets. For this case-study we used a detailed tank model as an object of interest, and simulated it across both rural and urban environments. We then tested how render settings influenced the classfication scores of the rendered images.

Environmental and Object Settings

Basic Embedded
basic basic
Singular Multiple
basic basic

We have generated several variations of synthetic datasets, containing singluar or multiple tanks in either a "basic" or "embedded" environment. A basic environment has no urban or natural infrastructure other than a ground condition and a sky, whereas an embedded environment generates variation in the surrounding context. The table below outlines the set of combinations of these conditions that we used to test the efficacy of rendered images.

Set Singular Multiple Basic Embedded
1 X X
2 X X
3 X X
4 X X

These conditions are chosen to represent the range of ways in which a tank is likely to appear in a rural or urban context. The position of the camera is varied across heights and distances from which a real photograph is likley to be taken. Throughout the duration of the camera's movement around the object, 2D images are rendered at a regular interval. The sum of these images form a synthetic dataset.

In order to evaluate whether the synthetic dataset is useful in a machine learning context, we tested the performance of an open-source CNN (Convolutional Neural Network) on them.

To benchmark, we used a MobileNet, which has been pre-trained on the ImageNet dataset, and already includes a label for "tank" in its default 1000 classes. Further information about image classifiers and a good introduction to CNN architecture can be found here.

We wanted to see how C4D render settings might effect a classifier's performance on simulated images. In order to do this, we took a single, real image of a tank that returned a strong prediction score (98.5%) from the classifier, and 3D modelled a replica of its contents and environmental conditions.

Real Synthetic
basic basic

The synthetic image was then rendered multiple times, each time varying the render settings. The rendered images were then classified using MobileNet. A brief overview of our findings is shown below. The classifier's score on the real image, 98.5%, is the benchmark for synthetic image scores.

Image basic basic basic
Renderer C4D Standard C4D Standard C4D Standard
Global Illumination OFF OFF OFF
Ambient Occlusion OFF OFF OFF
Anti-Aliasing None Best Geometry
Render Time (per 1000) 7 hrs 13 mins 8 hrs 3 mins 7 hrs 30 mins
Classifier Score 72.43 % 98.91 % 99.07 %
Image basic basic basic
Renderer C4D Standard C4D Standard C4D Standard
Global Illumination Physical-Sky Default Physical-Sky
Ambient Occlusion OFF OFF Default
Anti-Aliasing Geometry Best Best
Render Time (per 1000) 10 hrs 10 hrs 17 mins 13 hrs 20 mins
Classifier Score 97.27 % 97.90 % 99.21 %
Image basic basic basic
Rendered C4D Physical C4D Physical C4D Physical
Global Illumination Default OFF Default
Ambient Occlusion ON ON ON
Anti-Aliasing PAL Gauss Catmull
Render Time (per 1000) 16 hrs 7 mins 15 hrs 16 mins 16 hrs 23 mins
Classifier Score 96.99 % 97.39 % 99.20 %

We are using tests like these to begin to guide our processing of rendering synthetic data sets. For a more comprehensive study, see the PDF attached at the top of this post, Synthetic Data Report.

Performance of Pre-Trained (metrics)

There were observable differences in classifier score when we altered either image content or render settings. Below are the best render settings for each set 1 through 4. (The table with checks and crosses higher up in this post describes how each set differs).

Set Accuracy (%) Precision (%) Recall (%) F1 (%)
1 88.1 100 75.3 85.9
2 88.0 100 75.9 86.3
3 82.2 94.3 67.2 78.5
4 75.6 93.5 54.9 69.2

It seems that the introduction of embedded environments (sets 3 and 4) introduces significant variation in MobileNet's performance. In particular, the recall scores for images of multiple tanks in an embedded environment are much lower than any other performance measure. This is significant, since recall scores are inversly proportional to the number of false negative results, or tanks "missed" by the classifier. It therefore seems images contaning multiple objects with varying environmental settings are much more challenging for our image classifier to process, most likely because they do not reflect the form of the real images that MobileNet was trained on.

This indicates that it might be possible to improve classifier performance by retraining the outer layers of the CNN with synthetic data sets such as 3 and 4.

Feature Score Clustering

Using synthetic data to train classifiers is likely prone to introducing unforseen biases or to overfitting. We investigated the 'appropriateness' of each dataset by analysing them against all of feature scores MobileNet predicted (as opposed to focusing on just the 'tank' label).

We clustered the vectors returned from the classifier. Each data point represents an image, and the distribution of points reflects the similarity of the classifier's predicted content.

In the example below we trained an UMAP clusterer on the feature vectors taken from a dataset of real images. The clear separation of images we know to contain tanks, shown in red at the top left, and those we know do not, shown everywhere else in blue, signifies good classifier performance.

Overlaying each synthetic dataset reveals the distribution of simulated images contaning tanks in yellow and not containing tanks in cyan. An appropriate distribution would therefore show the yellow data points overlapping the red data points and the cyan data points overlapping the blue data points.

Colour Label
Real Tanks
Real Non-Tanks
Synthetic Tanks
Synthetic Non-Tanks
1 2
3 4

A quick look at each graph generally reveals good overlap between the synthetic and real images. However, in datasets 1 and 2 there is a slight systematic offset in the overlap of the synthetic tank (yellow) images and real tank (red) images, which may signify an artifical bias.

Interestingly, this bias appears to be decreased in the embedded datasets (3 & 4) and the overlap of synthetic and real images containing tanks becomes progressively more stochastic when either including multiple tanks or environmental embedding. This may signify that MobileNet has not been trained on images with similar content.

Therefore retraining using images taken from datasets 3 or 4 is likely to be more effective than datasets 1 and 2.

Next Steps

The first and most concrete step for us is to attempt to use the datasets that we have rendered to train classifiers. If you are a data scientist or organisation that is interested in working with us on this problem, please do not hesitate to write to fadev@forensic-architecture.org. We are actively looking for partners to help us continue this research.

Here are just a few of the ideas that we have to improve our synthetic rendering pipeline:

  • Render datasets for other objects of interests. If you are a human rights organisation with a use case or otherwise in need of a dataset, please do not hesitate to reach out to us.
  • Conduct further tests on the impact of camera sensor variations (motion blur and depth of field settings) and run more varied scenes through different render setting variations.
  • Adapt the scene to other modelling and animation software such as Blender or Unity, and test the impact of render settings across other renderers.
  • Experiment with post-processing the images to make them more appropriate for classifier use. For example, there are many interesting techniques using Generative Adversarial Networks (GANs) that could make rendered images more viable for training.


Note on Classification vs Detection

CNNs process images and return output probabilites that correspond to labels that they have been trained to detect. Whereas image classifiers only return a single probability per available label, object detectors can return more information about the multiple places that an object could reside within a single image.

In the example below, the image classifier returns a single set of predicitions whereas the object detector identifies the location of each seperate object, as well as returning its most likely prediction score:

Classifier Detector
dog dog_label
dog: 95.0% dog: 98.97% [128,224,314,537]
bicycle: 5.2% bicycle: 99.36% [162,119,565,441]
truck: 0.8% truck: 91.46% [475,85,689,170]

For this study we defaulted to using image classifiers since there are a larger number of open source variants available that have already been pre-trained on images of tanks. This is largely thanks to the ImageNet online 1000 class image recognition challenge including tanks as one the requisite labels.