GitHub Email ORCID Google Scholar Stack Overflow Twitter FOAF Zurich, Switzerland RSS
CoFiF is the first large corpus comprising company reports in the French language. It contains over 188 million tokens in 2655 reports, covering four types of documents:
These documents are collected from the 60 largest French companies listed in France’s main stock indices CAC40 and CAC Next 20. The corpus spans over 20 years, ranging from 1995 to 2018.
CoFiF can be downloaded at https://github.com/CoFiF/Corpus. The PDF files of the corpus can be found here. In addition to the PDF files which were collected from enterprises (all rights reserved), we provide the reports in raw text without further pre-processing. We also provide a cleaned dataset CoFiF_cleaned_all.txt
which was used for training our language model reported in the paper.
If you’re using CoFiF in your research, please don’t forget to cite this paper:
@inproceedings{daudert-ahmadi-2019-cofif,
title = "{C}o{F}i{F}: A Corpus of Financial Reports in {F}rench Language",
author = "Daudert, Tobias and
Ahmadi, Sina",
booktitle = "Proceedings of the First Workshop on Financial Technology and Natural Language Processing",
month = "12 " # aug,
year = "2019",
address = "Macao, China",
url = "https://www.aclweb.org/anthology/W19-5504",
pages = "21--26",
}
This corpus is openly available for non-commercial use under the Attribution-NonCommercial-ShareAlike 4.0 International.