Creating a text generation neural network with TensorFlow

Lejdi Prifti
4 min readSep 23, 2023

--

In this series of articles, I will show you how to create and improve a neural network that produces text using TensorFlow.

The dataset we will use for training our neural network will that of Wikipedia from HuggingFace.

To download the dataset, you firstly need apache-beam and datasets packages.

!pip install apache-beam
!pip install datasets

Then, we can download the Wikipedia dataset. We will use the simple version.

from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.simple")

The dataset contains a dictionary with only one entry named train. This entry is a Dataset object with 4 features, id, url, title and text. We will be working with the text feature.

Dataset({
features: ['id', 'url', 'title', 'text'],
num_rows: 205328
})

To get a feeling of the data, you can use the following code. It gets a random data entry from the training set and outputs the text.

import random
random_choice = random.choice(dataset['train'])
print(f'{random_choice["text"]}')

Next, we need to clean our training data. We will split the text into sentences and store them into an array.

Firstly, we need to import ntlk which is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

Then, we split the text.

text = []
for data in dataset['train']:
text.extend(sent_tokenize(data['text']))

We build the training data by getting the first 50000 sentences from the array above.

training_data = text[:50000]

In this step, we will process the text. Firstly, we will create a Tokenizer object and fit it on our training data.

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(training_data)

After fitting the data, we can observe some metrics from the tokenizer, such as the word index and calculate the total number of words.

total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)

One of the most important tasks in text preprocessing is splitting the sentences into n grams. Before doing that, we need to transform each sentence into a sequence of integers. As we know, deep learning networks like to work with numbers and they can’t handle string directly. That’s why we transform each line of the training data into a sequence of integers.

Then, we use this sequence of integers to split it into n grams and store into input_sequences array.

input_sequences = []
for single_line in training_data:
# transform each sentence into a sequence of integers
token_list = tokenizer.texts_to_sequences([single_line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)

Afterwards, we need to get the maximum sequence length and pad with zeros some of the sequences up to this maximum length.

import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_sequence_len = np.int32(np.percentile([len(x) for x in input_sequences], 75))

# padd the input_sequences until the max_sequence_len
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

Now, it’s time to build the dataset. We have what we need and we need to create the features and the labels. The labels are a vector with the last integer of each input sequence, while the features are a vector with the rest of the integers.

We use the Dataset API to improve the performance of the model and train faster.

xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
# build the dataset with batches of 512 and autotuned prefetch
dataset = tf.data.Dataset.from_tensor_slices((xs, labels)).batch(512).prefetch(tf.data.AUTOTUNE)

Let’s build our first model that contains the Input layer, the Embedding layer which is important for transforming the tokens into embeddings and 2 Bidirectional LSTM layers.

Finally, the ouput layer is a Dense layer that uses softmax activation function and total_words neurons. We create the model composed of inputs and outputs.

inputs = tf.keras.layers.Input(shape=(1,))

x = tf.keras.layers.Embedding(total_words, 128, input_length=max_sequence_len-1, mask_zero=True)(inputs)

x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32))(x)

outputs = tf.keras.layers.Dense(total_words, activation="softmax", name="output_layer")(x)

model_1 = tf.keras.models.Model(inputs, outputs)

Since we are using tokens as labels, we have to use sparse categorical cross entropy as our loss function. Additionally, I am choosing to use Adam as an optimizer and accuracy as metric to evaluate uppon.

model_1.compile(loss=tf.losses.SparseCategoricalCrossentropy(), 
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=['accuracy'])

It’s time to train the model.

model_1.fit(dataset, epochs=5)

When the training is over, we can use this helper function to generate some text. Each predicted word is fed to the input sequences used to predict the next word.

seed_text = "April is"
next_words = 10

for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = np.argmax(model_1.predict(token_list), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)

--

--

Lejdi Prifti

Software Engineer @ Linfa | Building daily | Sharing insights