This repository centralizes our research on skill extraction and classification from job offers and resumes in a multilingual, multidomain setting. The work was carried out through a multidisciplinary, industry-grounded collaboration among HR researchers, NLP researchers and HR-tech professionals, covering six languages, English, French, German, Italian, Spanish, and Portuguese, for both hard-skill extraction and soft-skill classification.
By Laura Vásquez-Rodríguez, Bertrand Audrin, Samuel Michel, Samuele Galli, Julneth Rogenhofer, Jacopo Negro Cusa and Lonneke van der Plas.
This README instead acts as a single entry point to the publications and the public datasets used to keep our benchmarks reproducible (see Data Availability).
📄 Paper · accepted at the 11th Swiss Text Analytics Conference (SwissText), 2026.
TL;DR — We comprehensively evaluate rule-based, semantic, and supervised skill-extraction methods on 1,200 annotated job offers and resumes across six languages and diverse domains. Supervised models reach F1 scores up to 0.6, while rule-based methods offer better interpretability. We find that skills are formulated very differently in resumes vs. job offers, with resumes notably understudied in prior academic work.
📄 Paper · accepted at the 10th Swiss Text Analytics Conference (SwissText), 2025.
TL;DR — Using a multilingual BERT-based classifier for soft skills, we compare soft skills, hard skills, and occupations along surface-level and semantic properties. Even when constrained to established taxonomy categories, soft skills show far greater variability in their textual expression than other entity types, a key challenge for recruitment algorithms.
Due to the proprietary nature of the job offers and resumes provided by our industrial partner, the release of the primary dataset is restricted. To ensure reproducibility while respecting these constraints, we adopt two strategies:
- Experiments on public data. Alongside the proprietary benchmarks, we report results on publicly available job offers and taxonomies (see Public Datasets below).
- Methodological transparency. Our papers document the methodology in full so the benchmarks remain as reproducible as possible, providing a transparent framework for our real-world skill-extraction approach.
| Dataset | Language(s) | Skill type | Link |
|---|---|---|---|
| Green | English | Hard / ICT skills in job ads | https://huggingface.co/datasets/jjzha/green |
| ESCO taxonomy | Multilingual (EU) | Skill & occupation taxonomy | https://esco.ec.europa.eu/en/use-esco/download |
| Fijo | French | Soft skills (insurance job ads) | https://huggingface.co/datasets/jjzha/fijo |
| Sayfullina | English | Soft skills | https://huggingface.co/datasets/jjzha/sayfullina |
If you use this work, please cite the relevant paper(s):
@inproceedings{vasquez-rodriguez-etal-2026-skill,
title = "Skill Extraction from Resumes and Job Offers across Six Languages",
author = "V{\'a}squez-Rodr{\'i}guez, Laura and
Audrin, Bertrand and
Michel, Samuel and
Galli, Samuele and
Rogenhofer, Julneth and
Cusa, Jacopo Negro and
van der Plas, Lonneke",
booktitle = "Proceedings of the 11th edition of the Swiss Text Analytics Conference",
year = "2026"
}@inproceedings{vasquez-rodriguez-etal-2025-soft,
title = "Soft Skills in the Wild: Challenges in Multilingual Classification",
author = "V{\'a}squez-Rodr{\'i}guez, Laura and
Audrin, Bertrand and
Michel, Samuel and
Galli, Samuele and
Rogenhofer, Julneth and
Cusa, Jacopo Negro and
van der Plas, Lonneke",
editor = {Gerber, Jonathan and
Cieliebak, Mark and
Tuggener, Don and
H{\"u}rlimann, Manuela},
booktitle = "Proceedings of the 10th edition of the Swiss Text Analytics Conference",
month = may,
year = "2025",
address = "Winterthur, Switzerland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.swisstext-1.11/",
pages = "108--114"
}This research was conducted at Idiap Research Institute as part of the SEM24 project, funded by Innosuisse, the Swiss Innovation Agency, and carried out in collaboration with our industrial partner Arca24, whose job offers and resumes made this multilingual, real-world study possible.