In this era where artificial intelligence is changing every domain from communication to content creation, distinguishing between human-generated and machine-generated text has become crucial. The ability to identify such content is essential for preserving the integrity of digital platforms. This report evaluates Long Short-Term Memory (LSTMs) and RoBERTa for the critical task of detecting machine-generated English text. The study reveals that RoBERTa consistently outperforms LSTM models, achieving higher accuracy and macro-F1 scores, particularly as the training data increases. RoBERTa’s ability to use the "Attention" mechanism and Long-Term Dependencies to capture complex language patterns, addressing the information bottleneck, and feeding the words in a parallel fashion gives it the upper hand over LSTM. In contrast, LSTM models struggle with processing long-term dependencies in text. The findings over the training data in sequence classification tasks highlight that RoBERTa outperforms LSTM models and stands as a more effective model for addressing the challenges of AI-generated text detection.
This project’s main goal is to create a detection system that can distinguish between English texts that were written by humans and those that were generated by machines. And to compare the performance of two machine learning models, RoBERTa and LSTM, in detecting machine-generated English text. Additionally, we aim to analyze how the LSTMs models handle different data sizes and change in parameters. With the growing capability to produce machine-generated text, this detection system shows a trustworthy way to identify such content, which is important for applications such as academic integrity, content verification, and disinformation prevention.
•The experiments were conducted using the official English dataset provided by COLING-2025-Workshop-on-MGT-Detection-Task1.
•The dataset has two labels: Human (0) and Machine-generated (1). The data format is as follows (using the first training sample as an example):
•The distribution of the dataset is shown in Figure 1 and Figure 2.
•It can be observed that the training set contains a total of 610,767 samples, including 228,922 Human-generated texts and 381,845 Machine-generated texts. The training set is imbalanced, and the class propor- tions differ from those in the development set.
•During training, a subset of the provided training set was used, and the entire training set was also evaluated later. However, no attempts were made to balance the dataset. This approach is not the best practice, as discussed in the limitations section.
•Preprocess data (including making everything lowercase and removing special characters) to reduce the vocabulary.
• were used for out of vocabulary words, and tokens were introduced to pad sequences to a uniform length (512), ensuring compatibility with batch training.
•Text data was tokenized at the word level and transformed into sequences by mapping tokens to indices, followed by truncation or padding.
•The LSTM classifier experiment with both unidirectional and bidirectional architectures:
•The input sequence is fed into the RNN in a left-to-right order. The hidden state of the last time step from the final LSTM layer was used as input for the classification head.
•Model structure is shown in Figure 3. Pro- cesses the input sequence in both forward and backward directions. The hidden states from the last forward and backward steps were concatenated, resulting in a representation twice the size of the LSTM hidden vector size.
•Both architectures passed the extracted hidden state(s) through a fully connected neural network (FFNN) with a sigmoid activation function for binary classification.
- The RoBERTa classifier was included as a baseline model:
- The roberta-base pretrained model was loaded and fine-tuned globally on the training set.
- RoBERTa was chosen as the baseline instead of BERT because we believe it is more likely to perform better on this task, primarily due to the differences between RoBERTa and BERT, as outlined in Table 1.
- Early stopping (using a subset of the development set) was enabled to prevent overfitting during training.
•The models were implemented using PyTorch’s nn.LSTM module.
•Bidirectionality was enabled by setting bidirectional=True.
•Experiments included hyperparameter tuning to analyze the effect of embedding size, dropout rate, and hidden vector size on model performance.
•The LSTM models were trained on the entire training set and evaluated on the development test set. For hyperparameter tuning, a subset of 100,000 samples from the training set was used due to computational limitations.
–Input sequence length: 512.
–Embedding size: 300 (for base experiments) with variations (200, 400, 600) tested for hyperparameter tuning.
–Hidden vector size: 128 (for base experiments) with variations (256, 512) tested for hyperparameter tuning.
–Number of LSTM layers: 2.
–Dropout rate: 0.5 (varied between 0.01 and 0.5 during tuning).
–Number of classification layers: 2.
–Cost function: Binary Cross Entropy Loss.
–Model Parameter Intialization: Shown in Table 2 (which may not be the optimal way, as discussed in the limitations section).
–Optimizer: Adam optimizer with a learning rate of 0.001.
–Batch size: 64.
–Training epochs: 10.
•The performance of the models was evaluated on the test set using Macro-F1 Score and Accuracy as metrics.
•Table 3 shows the performance of the baseline RoBERTa classifier, trained on a 1/10 subset of the training set (61,076 samples) and evaluated on the entire development test set.
•Figure 5 shows the performance of the LSTM classifier with different training set sizes.
•Table 4 shows the effect of embedding size on the performance of the LSTM classifier.
•Table 5 shows the effect of hidden vector size on the performance of the LSTM classifier.
•Figure 6 shows the training loss after 10 epochs for different dropout rates, while Figure 7 shows the effect of dropout rate on the performance of the LSTM classifier.
–The performance of the LSTM classifier improves as the dataset size increases from 1,000 to 100,000.
–The Bidirectional LSTM achieves slightly better scores on larger datasets due to its ability to handle complex dependencies and capture contextual information from both directions.
–Varying the embedding size from 200 to 600 shows inconsis- tent performance variations across the Uni-Directional and Bi- Directional architectures.
–Uni-Directional LSTM achieves its best performance with an embedding size of 200, attaining a Macro-F1 score of 0.64005, and accuracy of 0.65851. This suggests that reducing the embedding size can enhance performance for Uni-Directional LSTMs, possibly due to better generalization with fewer parameters when the training data is insufficient.
–Bi-Directional LSTM achieves its best performance with an embedding size of 300, obtaining a Macro-F1 score of 0.64571 and accuracy of 0.66020. This indicates that Bi-Directional LSTMs benefit more from moderate embedding dimensions, as larger embedding sizes like 600 do not result in better performance and require increased computational resources.
–Interestingly, both architectures perform worse with an embedding size of 600, reinforcing the notion that excessively large em- beddings may lead to overfitting or inefficiencies.
Effect of Hidden Vector Sizes:
- Increasing the hidden vector size improves the performance of both models in terms of Macro-F1, Accuracy, and Loss (see Table [3]).
- Bi-Directional LSTM:
- Optimal hidden vector size is 256, achieving the highest Macro-F1 (0.6391) and Accuracy (0.6559) scores.
- Uni-Directional LSTM:
- Shows a slight increase in performance as the hidden vector size grows.
- Bi-Directional LSTM:
Results from Figure [5]
- Highlights:
- The Bi-Directional LSTM consistently outperforms the Uni-Directional LSTM across all dropout rates, especially at higher regularization levels 0.5 and 0.01.
- The Uni-Directional LSTM performs better at the lowest dropout rate (0.001), suggesting it benefits from less regularization.
- The 0.5 dropout rate is an effective regularization level for the Bi-Directional LSTM.
- The 0.01 dropout rate strikes a good balance for both models.
- The Bi-Directional LSTM consistently outperforms the Uni-Directional LSTM across all dropout rates, especially at higher regularization levels 0.5 and 0.01.
•Freezing Parameters: The current approach globally fine-tunes pa- rameters, which is more computationally expensive compared to adding a classification head, and may not result in significant performance im- provement.
•In the training set, the ratio of human-generated text to machine-generated text is approximately 1:1.67. This imbal- ance causes the trained model to favor predicting text as machine- generated. A more balanced sampling approach could help mitigate this systematic issue.
•Similar to RoBERTa, truncating sequences to 512 tokens impacted the LSTM’s ability to capture long-range de- pendencies, further reducing its performance compared to models with contextual embeddings.
•Only the last hidden state of the LSTM is used for classification. This approach may overlook important in- formation from earlier timesteps, particularly in longer sequences. Al-ternative strategies, such as mean or max pooling across all hidden states, could provide richer sequence-level representations.
•The LSTM implementation re- lies on word-level tokenization, which results in significant OOV issues. Words not present in the vocabulary are mapped to a single token, leading to a loss of semantic information. Switching to subword tokenization techniques like Byte Pair Encoding (BPE) could mitigate this limitation.
•The current implementation uses PyTorch’s default parameter initialization for LSTM, including setting the forget gate bias to 0. However, in practice, setting it to 1 is often more effective in alleviating gradient vanishing issues and accelerating model training.
•The current training setup for the LSTM lacks advanced techniques such as masking, which could handle variable-length sequences more effectively and improve gradient flow during training.
This project focused on the critical task of distinguishing between human- generated and machine-generated English text using machine learning mod- els, specifically LSTM classifiers. The study revealed several key findings:
•The RoBERTa classifier consistently outperformed the LSTM models, achieving higher Macro-F1 scores. This highlights the effectiveness of pre-trained transformer-based models in handling complex text classi- fication tasks.
•Uni-Directional and Bi-Directional LSTMs exhibited varying perfor- mance based on hyperparameters such as embedding size, hidden vec- tor size, and dropout rate. Notably, Bi-Directional LSTM generally achieved better results than Uni-Directional LSTM, due to its ability to capture bidirectional contextual information.
•The embedding size and hidden vector size played significant roles in determining model performance. While a smaller embedding size
(200) worked best for Uni-Directional LSTMs, Bi-Directional LSTMs benefited from a moderate embedding size (300).
•Dropout rates impacted the regularization and overall performance of the models. A dropout rate of 0.5 provided optimal results for Bi-Directional LSTMs, while Uni-Directional LSTMs preferred lower regularization.
Despite these findings, the project faced limitations such as dataset im- balance, sequence truncation, and out-of-vocabulary issues, which likely constrained the performance of LSTM models. Future work can address these limitations by exploring advanced tokenization techniques, balancing sampling strategies, and incorporating alternative architectures or ensemble methods.In conclusion, while both models demonstrated their strengths, RoBERTa proved to be a more robust and scalable solution for detecting machine- generated text. This work underscores the importance of fine-tuning model parameters and leveraging pre-trained language models to address challenges in AI-generated text detection effectively.






