About Me - Liam Duignan

Data Collection & Curation for LLM Training

A significant part of my work involves building pipelines to collect and curate high-quality training data for large language models. This includes:

Educational Content Extraction – Developing systems to collect French educational materials including K-12 mathematics exercises and multiple-choice questions from sources like Sesamath, Exo7, and Mathalea
Academic Document Processing – Processing 200,000+ scientific papers and PhD theses from the HAL open science archive, producing approximately 7.7 billion tokens of French academic text
LaTeX Parsing – Extracting and preserving mathematical content from LaTeX sources, handling complex notation and formula structures
Ethical Web Scraping – Implementing crawlers that respect robots.txt rules and rate limits
Dataset Publishing – Preparing and releasing datasets on HuggingFace Hub for the research community

Document Processing & Format Conversion

Converting documents between formats while preserving structure and mathematical notation is a core challenge in preparing LLM training data:

PDF-to-Markdown Conversion – Using tools like Docling to convert academic PDFs to markdown while maintaining document structure, tables, and code blocks
Formula Handling – Validating and correcting LaTeX formulas using KaTeX, tracking error rates, and filtering malformed expressions
Vision-based OCR – Working with multimodal models like DeepSeek-OCR for document understanding and text extraction from images
Multi-stage Pipelines – Building processing chains that filter, transform, and clean content through HTML parsing, format conversion, and content reinsertion
Text Encoding Fixes – Correcting character encoding issues common in French text processing (accents, punctuation)

LLM Evaluation

Systematic evaluation is essential for understanding model capabilities. My work in this area includes:

Evaluation Frameworks – Building custom evaluation infrastructure using the LM Evaluation Harness on SLURM clusters
Benchmark Coverage – Running 131+ task configurations including TruthfulQA, GSM8K, MMLU variants, IFEval, and reading comprehension tasks
Multilingual Evaluation – Testing models across 25+ languages including Arabic, African languages (Swahili, Yoruba, Igbo), and European languages (French, German, Catalan, Basque)
Distributed Evaluation – Managing SLURM job submission and multi-GPU evaluation across multiple models simultaneously

Model Fine-tuning

I have worked with advanced fine-tuning techniques for improving language model performance:

Self-Play Fine-Tuning (SPIN) – Implementing iterative self-improvement methods where models learn from their own generated outputs without additional human annotation
Preference Learning – Training models to distinguish between high and low quality responses using preference-based objectives
Distributed Training – Using DeepSpeed ZeRO-3 and Accelerate for efficient multi-GPU training on large models

Technical Stack

Python

Primary language for all NLP work, data processing, and model development

HuggingFace

Transformers, Datasets, TRL, and the broader ecosystem for modern NLP

PyTorch

Deep learning framework for model training and inference

Distributed Computing

DeepSpeed, Accelerate, vLLM for efficient large-scale training and inference

Current Work at CEA

As an NLP Engineer at the French Alternative Energies and Atomic Energy Commission (CEA), I apply computational linguistics to real-world challenges. My work involves developing and evaluating language technologies that support the organization's research mission.

The role combines hands-on engineering with research-oriented thinking, requiring both practical implementation skills and a solid theoretical foundation in linguistics and machine learning.