Data Collection & Curation for LLM Training
A significant part of my work involves building pipelines to collect and curate high-quality training data for large language models. This includes:
- Educational Content Extraction – Developing systems to collect French educational materials including K-12 mathematics exercises and multiple-choice questions from sources like Sesamath, Exo7, and Mathalea
- Academic Document Processing – Processing 200,000+ scientific papers and PhD theses from the HAL open science archive, producing approximately 7.7 billion tokens of French academic text
- LaTeX Parsing – Extracting and preserving mathematical content from LaTeX sources, handling complex notation and formula structures
- Ethical Web Scraping – Implementing crawlers that respect robots.txt rules and rate limits
- Dataset Publishing – Preparing and releasing datasets on HuggingFace Hub for the research community
Document Processing & Format Conversion
Converting documents between formats while preserving structure and mathematical notation is a core challenge in preparing LLM training data:
- PDF-to-Markdown Conversion – Using tools like Docling to convert academic PDFs to markdown while maintaining document structure, tables, and code blocks
- Formula Handling – Validating and correcting LaTeX formulas using KaTeX, tracking error rates, and filtering malformed expressions
- Vision-based OCR – Working with multimodal models like DeepSeek-OCR for document understanding and text extraction from images
- Multi-stage Pipelines – Building processing chains that filter, transform, and clean content through HTML parsing, format conversion, and content reinsertion
- Text Encoding Fixes – Correcting character encoding issues common in French text processing (accents, punctuation)
LLM Evaluation
Systematic evaluation is essential for understanding model capabilities. My work in this area includes:
- Evaluation Frameworks – Building custom evaluation infrastructure using the LM Evaluation Harness on SLURM clusters
- Benchmark Coverage – Running 131+ task configurations including TruthfulQA, GSM8K, MMLU variants, IFEval, and reading comprehension tasks
- Multilingual Evaluation – Testing models across 25+ languages including Arabic, African languages (Swahili, Yoruba, Igbo), and European languages (French, German, Catalan, Basque)
- Distributed Evaluation – Managing SLURM job submission and multi-GPU evaluation across multiple models simultaneously
Model Fine-tuning
I have worked with advanced fine-tuning techniques for improving language model performance:
- Self-Play Fine-Tuning (SPIN) – Implementing iterative self-improvement methods where models learn from their own generated outputs without additional human annotation
- Preference Learning – Training models to distinguish between high and low quality responses using preference-based objectives
- Distributed Training – Using DeepSpeed ZeRO-3 and Accelerate for efficient multi-GPU training on large models
Current Work at CEA
As an NLP Engineer at the French Alternative Energies and Atomic Energy Commission (CEA), I apply computational linguistics to real-world challenges. My work involves developing and evaluating language technologies that support the organization's research mission.
The role combines hands-on engineering with research-oriented thinking, requiring both practical implementation skills and a solid theoretical foundation in linguistics and machine learning.