Arranging Images on the Command Line

I have been a Mac user for more than a decade, but lately I have started a journey towards replacing my Mac with a Linux laptop. This leads to a lot of small tasks that have to be carried out differently. One of them is getting images and videos from an iPad/iPhone to my laptop – without using iCloud.

This may not sound very “data science” like, but I had to go through some steps to get my files into the right folders: I sort my images by year and month, such that all images from e.g. March 2019 are in Pictures/2019/03.

I also love using the command line and this was a nice combo.

Get images from iPad

I ended up using the Air Transfer app that exposes a web interface to content on the iPad. From this interface I can download batches of images in zip files.

However, the images in the zip file are all created at the time of download.

example images

This has not been a problem when accessing the images from a Mac through a cable. Here the creation timestamp is retained when importing with e.g. Preview.

Get image creation date

Luckily, the Exif data in the image is intact. When installing ImageMagick on Ubuntu, we also get the program identify that can access Exif data. Run the following command to get all the Exif data:

identify -verbose <image file>

The entries related to creation time look like this:

exif:DateTime: 2019:01:12 16:11:02
exif:DateTimeDigitized: 2019:01:12 16:11:02
exif:DateTimeOriginal: 2019:01:12 16:11:02

To limit the output to specific fields (like the time of creation that I am looking for) we can write:

identify -format '%[EXIF:DateTime]' <image file>

The result is then of the form 2019:01:12 16:11:02.

To get this information for all files I use the following shell script:

#!/usr/bin/env sh

echo "filename;creationtime" > datetime.csv

ls *.JPG | while read f; do
    creationdate=`identify -format '%[EXIF:DateTime]' "$f"`
    echo "$f $creationdate" >> datetime.csv
done

This reads as follows:

The file datetime.csv is created/reset with the header line “filename;creationtime”.

All the JPG files are listed and read individually by the while. The benefit of using ls | while read f over for f in `ls` is that the former handles filenames with spaces (though that is not a problem here).

In the while loop the creation date is extracted and together with the filename this is added to the end of the file datetime.csv.

There are also PNG files on my iPad, but these do not have the creation date in the metadata.

Arrange images in folders

A bit of data munging is needed to achieve my goal. In particular, I have to handle those PNG files with unknown creation date. To this end, I use R. First I load in datetime.csv.

pictures_folder <- file.path(Sys.getenv("HOME"), "Pictures", "unsorted")

creation_date <- read_delim(
    file.path(pictures_folder, "datetime.csv"),
    delim =  ";", escape_double = FALSE,
    col_types = cols(
        filename = col_character(),
        creationtime = col_datetime(format = "%Y:%m:%d %H:%M:%S")
    ),
    trim_ws = TRUE
)

The loaded tibble looks something like this:

> creation_date
# A tibble: 3 x 2
   filename     creationtime       
   <chr>        <dttm>             
 1 IMG_0763.JPG 2018-11-02 17:58:26
 2 IMG_0962.JPG 2018-12-24 16:57:31
 3 IMG_0963.JPG 2018-12-24 16:57:33

Then all unsorted files are read and the JPG files are enriched with the creationtime by joining with the creation_date tibble. From creationtime we can extract year and month:

all_image_files <- dir(pictures_folder, pattern = "*.(JPG|PNG)", full.names = TRUE) %>%
    tibble::as_tibble() %>%
    dplyr::rename(frompath = value) %>%
    dplyr::mutate(
        filename = basename(frompath)
    ) %>%
    dplyr::left_join(creation_date, by = "filename") %>%
    dplyr::mutate(
        year = lubridate::year(creationtime),
        month = lubridate::month(creationtime)
    )

A section of the data that illustrates the left_join: The PNG files have no creationtime and therefore no year or month:

# A tibble: 6 x 5
  filename     creationtime         year month
  <chr>        <dttm>              <dbl> <dbl>
1 IMG_1197.JPG 2019-01-26 16:28:34  2019     1
2 IMG_1198.JPG 2019-01-26 16:28:37  2019     1
3 IMG_1199.PNG NA                     NA    NA
4 IMG_1200.PNG NA                     NA    NA
5 IMG_1201.JPG 2019-02-01 17:35:08  2019     2
6 IMG_1202.JPG 2019-02-01 17:35:10  2019     2

I choose to fill the NA’s by “Last Observation Carried Forward”, that is, NA’s are replaced by the last non-NA value above. Of course this is not possible for all values if the first rows have NAs, so na.rm is used to ensure that we get columns of the right size.

A final trick is that I want months in my folders to have two digits such that the lexicographical ordering align with time ordering. This is solved very neatly by str_pad.

Now we can construct the new path and remove any rows that have a non-valid path with NAs.

image_location <- all_image_files %>%
    dplyr::mutate(
        year = zoo::na.locf(year, na.rm = FALSE),
        month = zoo::na.locf(month, na.rm = FALSE) %>% stringr::str_pad(2, pad = "0"),
        topath = file.path(pictures_folder, year, month, filename)
    ) %>%
    tidyr::drop_na(year, month) %>%
    dplyr::select(frompath, topath)

The files shown above now have the following paths (where frompath is stripped of the folder to fit better on the screen).

> image_location
# A tibble: 7 x 2
  frompath     topath
  <chr>        <chr>
1 IMG_1197.JPG /home/robert/Pictures/2019/01/IMG_1197.JPG
2 IMG_1198.JPG /home/robert/Pictures/2019/01/IMG_1198.JPG
3 IMG_1199.PNG /home/robert/Pictures/2019/01/IMG_1199.PNG
4 IMG_1200.PNG /home/robert/Pictures/2019/01/IMG_1200.PNG
5 IMG_1201.JPG /home/robert/Pictures/2019/02/IMG_1201.JPG
6 IMG_1202.JPG /home/robert/Pictures/2019/02/IMG_1202.JPG

This tibble can then be exported for the next step.

readr::write_delim(image_location, file.path(pictures_folder, "image_location.csv"), col_names = FALSE)

Run as a script

The commands can now be collected in a single script to run. But instead for starting R to run this script, we can turn the R script into a command line script by including an appropriate shebang in the first line:

#!/usr/bin/env Rscript

By saving the as image_location.R and making it executable with

chmod u+x image_location.R
the script can run by the command ./image_location.R.

Moving files

Finally the files can be moved based on the image_location.csv. I use another shell script that loops through the lines of image_location.csv, extracting the fromfile and tofile using AWK and then moving them.

#!/usr/bin/env sh

while read l; do
	fromfile=`echo "$l" | awk '{print $1}'`
	tofile=`echo "$l" | awk '{print $2}'`

	mv "$fromfile" "$tofile"
done < image_location.csv