The built-in Input Pipeline. Never use ‘feed-dict’ anymore
Update 2/06/2018: Added second full example to read csv directly into the dataset
feed-dict is the slowest possible way to pass information to TensorFlow and it must be avoided. The correct way to feed data into your models is to use an input pipeline to ensure that the GPU has never to wait for new stuff to come in.
Dataset to make it easier to accomplish this task. In this tutorial, we are going to see how we can create an input pipeline and how to feed the data into the model efficiently.
This article will explain the basic mechanics of the Dataset, covering the most common use cases.
https://github.com/FrancescoSaverioZuppichini/Tensorflow-Dataset-Tutorial/blob/master/dataset_tutorial.ipynb
This is the common case, we have a numpy array and we want to pass it to tensorflow.
We can also pass more than one numpy array, one classic example is when we have a couple of data divided into features and labels
This is useful when we want to dynamically change the data inside the Dataset, we will see later how.
We can also initialise a Dataset from a generator, this is useful when we have an array of different elements length (e.g a sequence):
In this case, you also need to specify the types and the shapes of your data that will be used to create the correct tensors.
You can directly read a csv file into a dataset. For example, I have a csv file with tweets and their sentiment.
tf.contrib.data.make_csv_dataset . Be aware that the iterator will create a dictionary with key as the column names and values as Tensor with the correct row value.
# load a csv CSV_PATH = './tweets.csv' dataset = tf.contrib.data.make_csv_dataset(CSV_PATH, batch_size=32) iter = dataset.make_one_shot_iterator() next = iter.get_next() print(next) # next is a dict with key=columns names and value=column data inputs, labels = next['text'], next['sentiment'] with tf.Session() as sess: sess.run([inputs, labels]) next is
{'sentiment': <tf.Tensor 'IteratorGetNext_15:0' shape=(?,) dtype=int32>, 'text': <tf.Tensor 'IteratorGetNext_15:1' shape=(?,) dtype=string>} Create an Iterator Iterator, that will give us the ability to iterate through the dataset and retrieve the real values of the data. There exist four types of iterators.
One shot. It can iterate once through a dataset, you cannot feed any value to it.Initializable : You can dynamically change calling its initializer
operation and passing the new data with feed_dict
. It’s basically a bucket that you can fill with stuff.Reinitializable : It can be initialised from different Dataset.
Very useful when you have a training dataset that needs some additional transformation, eg. shuffle, and a testing dataset. It’s like using a tower crane to select a different container.Feedable : It can be used to select with iterator to use. Following the previous example, it’s like a tower crane that selects which tower crane to use to select which container to take. In my opinion is useless.One shot Iterator This is the easiest iterator. Using the first example
x = np.random.sample((100,2)) # make a dataset from a numpy array dataset = tf.data.Dataset.from_tensor_slices(x) # create the iterator iter = dataset.make_one_shot_iterator() get_next() to get the tensor that will contain your data
... # create the iterator iter = dataset.make_one_shot_iterator() el = iter.get_next() el in order to see its value
with tf.Session() as sess: print(sess.run(el)) # output: [ 0.42116176 0.40666069] Initializable Iterator initializable iterator. Using example three from last section
# using a placeholder x = tf.placeholder(tf.float32, shape=[None,2]) dataset = tf.data.Dataset.from_tensor_slices(x) data = np.random.sample((100,2)) iter = dataset.make_initializable_iterator() # create the iterator el = iter.get_next() with tf.Session() as sess: # feed the placeholder with data sess.run(iter.initializer, feed_dict={ x: data }) print(sess.run(el)) # output [ 0.52374458 0.71968478] initializer operation in order to pass our data, in this case a random numpy array. .
Imagine that now we have a train set and a test set, a real common scenario:
train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.array([[1,2]]), np.array([[0]])) Then we would like to train the model and then evaluate it on the test dataset, this can be done by initialising the iterator again after training
# initializable iterator to switch between dataset EPOCHS = 10 x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1]) dataset = tf.data.Dataset.from_tensor_slices((x, y)) train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.array([[1,2]]), np.array([[0]])) iter = dataset.make_initializable_iterator() features, labels = iter.get_next() with tf.Session() as sess: # initialise iterator with train data sess.run(iter.initializer, feed_dict={ x: train_data[0], y: train_data[1]}) for _ in range(EPOCHS): sess.run([features, labels]) # switch to test data sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1]}) print(sess.run([features, labels])) Reinitializable Iterator The concept is similar to before, we want to dynamic switch between data. But instead of feed new data to the same dataset, we switch dataset. As before, we want to have a train dataset and a test dataset
# making fake data using numpy train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((10,2)), np.random.sample((10,1))) We can create two Datasets
# create two datasets, one for training and one for test train_dataset = tf.data.Dataset.from_tensor_slices(train_data) test_dataset = tf.data.Dataset.from_tensor_slices(test_data) Now, this is the trick, we create a generic Iterator
# create a iterator of the correct shape and type iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes) and then two initialization operations:
# create the initialisation operations train_init_op = iter.make_initializer(train_dataset) test_init_op = iter.make_initializer(test_dataset) We get the next element as before
features, labels = iter.get_next() Now, we can directly run the two initialisation operation using our session. Putting all together we get:
# Reinitializable iterator to switch between Datasets EPOCHS = 10 # making fake data using numpy train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((10,2)), np.random.sample((10,1))) # create two datasets, one for training and one for test train_dataset = tf.data.Dataset.from_tensor_slices(train_data) test_dataset = tf.data.Dataset.from_tensor_slices(test_data) # create a iterator of the correct shape and type iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes) features, labels = iter.get_next() # create the initialisation operations train_init_op = iter.make_initializer(train_dataset) test_init_op = iter.make_initializer(test_dataset) with tf.Session() as sess: sess.run(train_init_op) # switch to train dataset for _ in range(EPOCHS): sess.run([features, labels]) sess.run(test_init_op) # switch to val dataset print(sess.run([features, labels])) Feedable Iterator reinitializable iterator, but instead of switch between datasets, it switch between iterators. After we created two datasets
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)) test_dataset = tf.data.Dataset.from_tensor_slices((x,y)) one shotiterator
train_iterator = train_dataset.make_initializable_iterator() test_iterator = test_dataset.make_initializable_iterator() handle , that will be out placeholder that can be dynamically changed.
handle = tf.placeholder(tf.string, shape=[]) Then, similar to before, we define a generic iterator using the shape of the dataset
iter = tf.data.Iterator.from_string_handle( handle, train_dataset.output_types, train_dataset.output_shapes) Then, we get the next elements
next_elements = iter.get_next() handle in the feed_dict. For example, to get one element from the train set:
sess.run(next_elements, feed_dict = {handle: train_handle}) initializable iterators, as we are doing, just remember to initialize them before starting
sess.run(train_iterator.initializer, feed_dict={ x: train_data[0], y: train_data[1]}) sess.run(test_iterator.initializer, feed_dict={ x: test_data[0], y: test_data[1]}) Putting all together we get:
https://www.tensorflow.org/programmers_guide/datasets#creating_an_iterator handle = tf.placeholder(tf.string, shape=[]) iter = tf.data.Iterator.from_string_handle( handle, train_dataset.output_types, train_dataset.output_shapes) next_elements = iter.get_next() with tf.Session() as sess: train_handle = sess.run(train_iterator.string_handle()) test_handle = sess.run(test_iterator.string_handle()) # initialise iterators. sess.run(train_iterator.initializer, feed_dict={ x: train_data[0], y: train_data[1]}) sess.run(test_iterator.initializer, feed_dict={ x: test_data[0], y: test_data[1]}) for _ in range(EPOCHS): x,y = sess.run(next_elements, feed_dict = {handle: train_handle}) print(x, y) x,y = sess.run(next_elements, feed_dict = {handle: test_handle}) print(x,y) Consuming data next element in the Dataset.
... next_el = iter.get_next() ... print(sess.run(next_el)) # will output the current element get_next()
.random.sample in another numpy array to add a dimension that we is needed to batch the data
# using two numpy arrays features, labels = (np.array([np.random.sample((100,2))]), np.array([np.random.sample((100,1))])) dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE) Then as always, we create an iterator
iter = dataset.make_one_shot_iterator() x, y = iter.get_next() We make a model, a simple neural network
# make a simple model net = tf.layers.dense(x, 8) # pass the first value from iter.get_next() as input net = tf.layers.dense(net, 8) prediction = tf.layers.dense(net, 1) loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label train_op = tf.train.AdamOptimizer().minimize(loss) iter.get_next() as input to the first layer and as labels for the loss function. Wrapping all together:
EPOCHS = 10 BATCH_SIZE = 16 # using two numpy arrays features, labels = (np.array([np.random.sample((100,2))]), np.array([np.random.sample((100,1))])) dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE) iter = dataset.make_one_shot_iterator() x, y = iter.get_next() # make a simple model net = tf.layers.dense(x, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input net = tf.layers.dense(net, 8, activation=tf.tanh) prediction = tf.layers.dense(net, 1, activation=tf.tanh) loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label train_op = tf.train.AdamOptimizer().minimize(loss) Output:
Iter: 0, Loss: 0.1328 Iter: 1, Loss: 0.1312 Iter: 2, Loss: 0.1296 Iter: 3, Loss: 0.1281 Iter: 4, Loss: 0.1267 Iter: 5, Loss: 0.1254 Iter: 6, Loss: 0.1242 Iter: 7, Loss: 0.1231 Iter: 8, Loss: 0.1220 Iter: 9, Loss: 0.1210 Useful Stuff Batch batch(BATCH_SIZE) that automatically batches the dataset with the provided size. The default value is one. In the following example, we use a batch size of 4
# BATCHING BATCH_SIZE = 4 x = np.random.sample((100,2)) # make a dataset from a numpy array dataset = tf.data.Dataset.from_tensor_slices(x).batch(BATCH_SIZE) iter = dataset.make_one_shot_iterator() el = iter.get_next() with tf.Session() as sess: print(sess.run(el)) Output:
[[ 0.65686128 0.99373963] [ 0.69690451 0.32446826] [ 0.57148422 0.68688242] [ 0.20335116 0.82473219]] Repeat .repeat() we can specify the number of times we want the dataset to be iterated. If no parameter is passed it will loop forever, usually is good to just loop forever and directly control the number of epochs with a standard loop.
Shuffle shuffle() that shuffles the dataset by default every epoch.
Remember: shuffle the dataset is very important to avoid overfitting.
buffer_size , a fixed size buffer from which the next element will be uniformly chosen from. Example:
# BATCHING BATCH_SIZE = 4 x = np.array([[1],[2],[3],[4]]) # make a dataset from a numpy array dataset = tf.data.Dataset.from_tensor_slices(x) dataset = dataset.shuffle(buffer_size=100) dataset = dataset.batch(BATCH_SIZE) iter = dataset.make_one_shot_iterator() el = iter.get_next() with tf.Session() as sess: print(sess.run(el)) First run output:
[[4] [2] [3] [1]] Second run output:
[[3] [1] [2] [4]] seed parameter.
Map mapmethod. In the following example we multiply each element by two:
# MAP x = np.array([[1],[2],[3],[4]]) # make a dataset from a numpy array dataset = tf.data.Dataset.from_tensor_slices(x) dataset = dataset.map(lambda x: x*2) iter = dataset.make_one_shot_iterator() el = iter.get_next() with tf.Session() as sess: # this will run forever for _ in range(len(x)): print(sess.run(el)) Output:
[2] [4] [6] [8] Full example Initializable iterator Initializable iterator
# Wrapping all together -> Switch between train and test set using Initializable iterator EPOCHS = 10 # create a placeholder to dynamically switch between batch sizes batch_size = tf.placeholder(tf.int64) x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1]) dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size).repeat() # using two numpy arrays train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((20,2)), np.random.sample((20,1))) iter = dataset.make_initializable_iterator() features, labels = iter.get_next() # make a simple model net = tf.layers.dense(features, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input net = tf.layers.dense(net, 8, activation=tf.tanh) prediction = tf.layers.dense(net, 1, activation=tf.tanh) loss = tf.losses.mean_squared_error(prediction, labels) # pass the second value from iter.get_net() as label train_op = tf.train.AdamOptimizer().minimize(loss) # initialise iterator with test data sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1], batch_size: test_data[0].shape[0]}) print('Test Loss: {:4f}'.format(sess.run(loss))) Notice that we use a placeholder for the batch size in order to dynamically switch it after training
Output
Training... Iter: 0, Loss: 0.2977 Iter: 1, Loss: 0.2152 Iter: 2, Loss: 0.1787 Iter: 3, Loss: 0.1597 Iter: 4, Loss: 0.1277 Iter: 5, Loss: 0.1334 Iter: 6, Loss: 0.1000 Iter: 7, Loss: 0.1154 Iter: 8, Loss: 0.0989 Iter: 9, Loss: 0.0948 Test Loss: 0.082150 Reinitializable Iterator Reinitializable Iterator
# Wrapping all together -> Switch between train and test set using Reinitializable iterator EPOCHS = 10 # create a placeholder to dynamically switch between batch sizes batch_size = tf.placeholder(tf.int64) x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1]) train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(batch_size).repeat() test_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(batch_size) # always batch even if you want to one shot it # using two numpy arrays train_data = (np.random.sample((100,2)), np.random.sample((100,1))) test_data = (np.random.sample((20,2)), np.random.sample((20,1))) # create a iterator of the correct shape and type iter = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes) features, labels = iter.get_next() # create the initialisation operations train_init_op = iter.make_initializer(train_dataset) test_init_op = iter.make_initializer(test_dataset) # make a simple model net = tf.layers.dense(features, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input net = tf.layers.dense(net, 8, activation=tf.tanh) prediction = tf.layers.dense(net, 1, activation=tf.tanh) loss = tf.losses.mean_squared_error(prediction, labels) # pass the second value from iter.get_net() as label train_op = tf.train.AdamOptimizer().minimize(loss) # initialise iterator with test data sess.run(test_init_op, feed_dict = {x : test_data[0], y: test_data[1], batch_size:len(test_data[0])}) print('Test Loss: {:4f}'.format(sess.run(loss))) Other resources https://www.tensorflow.org/programmers_guide/datasets
Dataset docs:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Conclusion Dataset API gives us a fast and robust way to create optimized input pipeline to train, evaluate and test our models. In this article, we have seen most of the common operation we can do with them.
jupyter-notebook that I’ve made for this article as a reference.
Thank you for reading,
Francesco Saverio来源: https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428