AI vs Human — Detect LLM generated text
First article of 2024! Let’s detect LLM generated text!
In this article, I will walk you through a step-by-step guide on how you can create a deep learning model able to detect if a text is generated by an LLM or written by a human. I’ll provide you access to a few HuggingFace datasets, and to improve our performance, we’ll apply BERT. Can you detect if this text was generated by an LLM or not?
What is BERT?
BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide
range of tasks, such as question answering and language inference, without substantial task specific architecture modifications.
Source: https://arxiv.org/pdf/1810.04805.pdf
Load the dataset
Before we go more in depth into BERT, let’s start by loading the datasets first. We are going to work with two unique datasets from HuggingFace. One of them is the Ivvy Panda essays, which will provide us with human written text. The other dataset is the…