Generating Shayri using custom GPT3

Ismail Siddiqui
5 min readJul 24, 2023

--

In this blog post, we will delve into the technicalities of generating Shayari using the GPT-3 model from OpenAI, specifically the smaller version, using the Hugging Face Transformers library. Shayari is a form of poetry that originated from the Indian subcontinent. It is a beautiful and expressive way of conveying emotions and thoughts.

Introduction

GPT-3, or Generative Pretrained Transformer 3, is a state-of-the-art autoregressive language model that uses deep learning to produce human-like text. It’s the third iteration of the GPT series developed by OpenAI and has 175 billion machine learning parameters. However, for this project, we will be using the smaller version of GPT-3, which is more accessible and easier to manage.

Hugging Face is a company that created the Transformers library, which provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, summarization, translation, text generation, etc. It’s a very popular library in the NLP (Natural Language Processing) community.

Setting Up the Environment

First, we need to install the necessary libraries. You can install the Hugging Face Transformers and tokenizers library using pip:

pip install transformers tokenizers

Tokenizer Training from Scratch

While the pre-trained GPT-2 tokenizer is a powerful tool, there may be instances where you need to train a tokenizer from scratch. This is becaue the input data is in Hinglish. Hugging Face provides the Tokenizers library, which allows you to train a tokenizer from scratch. Here's how you can do it:

Training the Tokenizer

Next, you can train the tokenizer. For this, you need a list of all the texts in your dataset. You can use the same Shayari dataset as before.

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Train the tokenizer
tokenizer.train(files=["data.txt"], vocab_size=50000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>", # This is the EOS token
"<unk>",
"<mask>",
])

# Save the tokenizer
tokenizer.save_model("hinglish_tokenizer")

Loading the Model

After installing the necessary libraries, we can load the GPT-3 model. We will use the GPTNeopForCasualLM and GPT2Tokenizer classes from the transformers library. The "gptneop" identifier refers to the GPT-3 model, which is the smaller version of GPT-3.

Training the Model

Now we can fine-tune our model. Learning rate of 0.001.

from transformers import Trainer, TrainingArguments
from transformers import GPT2LMHeadModel, GPTNeoForCausalLM

model = GPTNeoForCausalLM.from_pretrained('minhtoan/gpt3-small-finetune-cnndaily-news')

num_epochs = 300
steps_per_epoch = 100
total_steps = num_epochs * steps_per_epoch

training_args = TrainingArguments(output_dir="results", num_train_epochs=num_epochs, per_device_train_batch_size=32,
weight_decay=0.00001, learning_rate = 0.001, logging_dir='logs', max_steps = total_steps, fp16 = True)

trainer = Trainer(model=model, args=training_args, train_dataset=tokenized_dataset, eval_dataset=None)
trainer.train()

In the above code, we train the model for 5 epochs. An epoch is a complete pass through the entire dataset. The loss function measures the difference between the model’s predictions and the actual values. The optimizer updates the model’s parameters to minimize the loss.

The Mathematics Behind the Model

The GPT-3 model is based on the Transformer architecture, which uses self-attention mechanisms. The self-attention mechanism allows the model to weigh the importance of words in a sentence. For example, in the sentence “ek hunar hai jo kar gaya hu main sab ke dil se utar gaya hu main sun raha hu ghar gaya hu mai”, the words “hunar” is more important than the other words. The self-attention mechanism allows the model to capture this importance.

The self-attention mechanism is calculated using the following formula:

Where Q, K, and V are the query, key, and value vectors, and d_k is the dimension of the key vector. The softmax function ensures that the weights sum to 1, and the dot product between the query and key vectors determines the weight of each word.

The GPT-3 model also uses positional encoding to capture the order of the words in a sentence. The positional encoding is added to the word embeddings and has the same dimension, allowing the model to learn the importance of word order.

The loss function used in the training process is the cross-entropy loss, which is calculated using the following formula:

Where y is the actual value and y_hat is the predicted value. The cross-entropy loss measures the dissimilarity between the actual and predicted probability distributions.

Generating Shayari

Now that we have our fine-tuned model, we can generate Shayari. We will use the generate method of the model to generate text. We will also use the decode method of the tokenizer to convert the generated tokens back into text.

from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPTNeoForCausalLM

model = GPTNeoForCausalLM.from_pretrained("results/checkpoint-50000/")

tokenizer = GPT2Tokenizer.from_pretrained("hinglish_tokenizer", pad_token="<pad>")

prompt = "jaan"
inputs = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(inputs, max_length=25, num_return_sequences=1, temperature=1, do_sample=True)

for sequence in output:
sequence = sequence.tolist()
text = tokenizer.decode(sequence, clean_up_tokenization_spaces=True)
text = re.sub('<pad>', '', text).strip()
text = re.sub('\s+', ' ', text)
print(text)
jaan khapate hain gham e ishq men khush khush ali kaisi lazzat ka ye azar banaya hua hai

In the above code, input_text is the initial text that the model will use to generate the rest of the Shayari. The max_length parameter specifies the maximum length of the generated text. The temperature parameter controls the randomness of the output. A lower value makes the output more deterministic, while a higher value makes it more diverse.

Results

Here are some results generated using word “zindagi”, “ishq” and “bewafa”

Conclusion

In this blog, we delved into the technicalities of generating Shayari using the GPT-3 model from OpenAI and the Hugging Face Transformers library. We discussed how to set up the environment, load the model, fine-tune the model on a Shayari dataset, and generate Shayari. We also discussed the mathematics behind the model, including the self-attention mechanism and the cross-entropy loss.

Remember, while AI has made significant strides in text generation, it’s still not perfect. Always review and edit the generated text as necessary. Happy coding, and enjoy your AI-generated Shayari!

Future Work

  • Collect more data.
  • Train LAMA-2 model.
  • Attach a audio generation on selected shayar like Jaun Elia.

--

--

Ismail Siddiqui
Ismail Siddiqui

Written by Ismail Siddiqui

Machine Learning Engineer at AppyHigh. I have phenomenal problem solving and Machine Learning skill. Seeking to do an impossible task that no one can’t do.

No responses yet