Most of the frameworks these days provide easy ways of loading, preprocessing and pipelining of data. Today, we will discuss various ways we can load data using TensorFlow and Keras. This is the first step followed by data augmentation and preprocessing.

try code yourself at this colab

1. image_dataset_from_directory

A high-level Keras preprocessing utility to read a directory of images on disk. Data is expected to be in a directory structure where each subdirectory represents a class.

main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

Calling image_dataset_from_directory(main_directory, labels='inferred') will return a tf.data.Dataset that yields batches of images from the subdirectories class_a and class_b, together with labels 0 and 1 (0 corresponding to class_a and 1 corresponding to class_b).

In case of more than two subdirectories, the labels will be inferred and start from 0,1,2,3... as this is a multi-class classification problem.

I found two ways to utilize this either from tf.keras.utils.image_dataset_from_directory or tf.keras.preprocessing.image_dataset_from_directory

batch_size = 32
img_height, img_width = 150, 150
seed = 42
import tensorflow as tf

# download raw data
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
                                   fname='flower_photos',
                                   untar=True)
data_dir = pathlib.Path(data_dir)

# Load data off disc using a Keras utility
train_ds = tf.keras.utils.image_dataset_from_directory(			
                            data_dir,
                            validation_split=0.2,
                            subset="training",
                            seed=seed,
                            image_size=(img_height, img_width),
                            batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
                            data_dir,
                            validation_split=0.2,
                            subset="validation",
                            seed=seed,
                            image_size=(img_height, img_width),
                            batch_size=batch_size)


# Found 3670 files belonging to 5 classes.
# Using 2936 files for training.
# Found 3670 files belonging to 5 classes.
# Using 734 files for validation.
                            

2. tf.data

An API for input pipelines for finer control, where we can write our own pipeline using tf.data . We will create dataset by passing the directory and its contents to tf.data.Dataset.list_files. Here list_files expects glob patterns to be matched.

import os
import numpy as np
import tensorflow as tf

img_height, img_width = 150, 150
AUTOTUNE = tf.data.AUTOTUNE

# download raw data
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
                                   fname='flower_photos',
                                   untar=True)
data_dir = pathlib.Path(data_dir)

# total number of images
image_count = len(list(data_dir.glob('*/*.jpg')))

list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'), shuffle=False)

val_size = int(image_count * 0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)

class_names = np.array(sorted([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"]))

print("Using {} files for training.".format(len(train_ds)))
print("Using {} files for validation.".format(len(val_ds)))

def get_label(file_path):
  # Convert the path to a list of path components
  parts = tf.strings.split(file_path, os.path.sep)
  # The second to last is the class-directory
  one_hot = parts[-2] == class_names
  # Integer encode the label
  return tf.argmax(one_hot)

def decode_img(img):
  # Convert the compressed string to a 3D uint8 tensor
  img = tf.io.decode_jpeg(img, channels=3)
  # Resize the image to the desired size
  return tf.image.resize(img, [img_height, img_width])

def process_path(file_path):
  label = get_label(file_path)
  # Load the raw data from the file as a string
  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label

# Use Dataset.map to create a dataset of image, label pairs:
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
train_ds = train_ds.map(process_path, num_parallel_calls=AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=AUTOTUNE)


# Using 2936 files for training.
# Using 734 files for validation.

3. tensorflow_datasets

TensorFlow provides a large catalog of easy-to-download datasets. Using tfds.load arguments such as split we can choose which split to read (e.g. ‘train’, [‘train’, ‘test’], ‘train[80%:]’,…)

import tensorflow as tf
import tensorflow_datasets as tfds

(train_ds, val_ds), info = tfds.load(
                                    'tf_flowers',
                                    split=['train[:80%]', 'train[80%:]'],
                                    with_info=True,
                                    as_supervised=True,
                                    )

print("Using {} files for training.".format(len(train_ds)))
print("Using {} files for validation.".format(len(val_ds)))

# Using 2936 files for training.
# Using 734 files for validation.

This is the easiest of all but has limited(but growing) number of datasets.

Now it may happen that raw data is not according to the directory-format expected by above APIs. So, to rearrange the data according to our needs, we can use python modules such as shutils etc. and then feed it to TensorFlow APIs.

That’s it for today. We discussed how to load data using TensorFlow and Keras. I will be back with next steps as to how to do augmentation , preprocessing and how to feed input to Model.

For an end-to-end Deep Learning flow please visit github

let us connect on linkedin and twitter

linkedin article

Read this article and other articles on Machine Learning, Deep Learning and Computer Vision.

References:

tensorflow.org