Linguistic Resources and Tools
On the Language Technology-related
web site of the Joint Research Centre (JRC), you can find
several resources for download. This page gives you an overview
of what is available. Please adhere to the usage conditions of each
of the resources, if applicable.
The data releases are in line
with the general effort of the European Commission to support multilingualism,
language diversity and the re-use of Commission information.
JRC-Acquis
The JRC-Acquis
is a multilingual sentence-aligned parallel corpus in 22 languages,
containing a total of over 1 billion words. This collection of documents
and their manually produced translations can be used for many purposes,
including the training of statistical machine translation systems,
the training and testing of text mining applications, and more.
Details
on this resource can be found at the URL: http://langtech.jrc.ec.europa.eu/JRC-Acquis.html.
Languages: Bulgarian, Czech, Danish,
Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian,
Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,
Slovak, Slovene, Spanish, Swedish.
Date of first release: May 2006.
DGT-Translation Memory (DGT-TM)
DGT-TM
is a Translation Memory of the Acquis Communautaire, i.e. the body
of European legislation, including all the treaties, regulations
and directives adopted by the European Union (EU) and the rulings
of the European Court of Justice. Translation memories are collections
of small pieces of text and their manually produced translations.
Translation memories are typically used to support human translators,
but they can also be used to train statistical machine translation
systems. DGT-TM consists of up to 2 million units per language.
It is distributed in the widely used TMX format. More
details can be found at the URL: http://langtech.jrc.ec.europa.eu/DGT-TM.html.
Languages: Bulgarian, Czech, Danish,
Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian,
Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,
Slovak, Slovene, Spanish, Swedish.
Date of release:
November 2007.
ECDC-Translation Memory (ECDC-TM)
ECDC-TM
is a Translation Memory of the web pages of the European Centre
for Disease Prevention and Control (ECDC). The major part of
the documents talks about health-related topics (anthrax, botulism,
cholera, dengue fever, hepatitis, etc.), but some of the web pages
also describe the organisation ECDC (e.g. its organisation, job
opportunities) and its activities (e.g. epidemic intelligence, surveillance).
ECDC-TM consists
of up to 2500 translation units per language. It is distributed
in the widely used TMX format. More details can be found at the
URL: http://langtech.jrc.ec.europa.eu/ECDC-TM.html.
Languages (25): Bulgarian, Czech,
Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian,
Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish,
Turkish.
Date of release:
October 2012.
Sentiment-annotated set of quotations
This is a set of 1590 English
language quotations (reported speech) extracted automatically from
the news and annotated manually for the sentiment expressed towards
entities (persons or organisations) mentioned inside the quotation.
For each quote, the resource consists of the text found inside the
quotation markers, the speaker (the person who issued the quotation),
the entity mentioned inside the quotation, as well as two manually
produced sentiment judgements. The data is distributed as an Excel
file with three sheets: one containing important background information
(the Readme), one containing the instructions given to the
annotators, and one containing the main data. You can download the
English
language sentiment-annotated set of quotations at the URL: 2010_JRC_1590-Quotes-annotated-for-sentiment.zip.
Language: English.
Date of release: June 2010.
Multilingual summary evaluation data
This is a manually annotated collection of document clusters of
parallel texts in seven languages (Arabic, Czech, English, French,
German, Russian and Spanish) that can be used to evaluate multi-document,
or even single document, summarisation software. The accompanying
publication (Turchi et al. (2010). Using parallel corpora for multilingual (multi-document) Summarisation Evaluation. Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation
time can be saved by projecting the monolingual sentence selection
annotation across languages due to the sentence alignment information
in this parallel corpus. Various ways are proposed to make use of
the varying degree of overlap of the manual annotation by four different
annotators. The downloadable zip file contains the full text of
all documents in seven languages, sentence-split full texts, sentence
alignment information for all language pairs involving English,
as well as the annotations of the English documents. Important background
information about the xml structure of the files can be found in
the Readme file. The four document clusters consist of
five high-level commentaries each selected from http://www.project-syndicate.org/,
discussing fields that can roughly be described as being about malaria,
Israel-and-Palestine-Conflict, genetics and science-and-society.
You can download the manually
annotated multilingual multi-document summary evaluation data
at the URL: http://langtech.jrc.ec.europa.eu/Resources/2010_JRC_multilingual-summary-evaluation.zip.
Languages: Arabic,
Czech, English, French, German, Russian and Spanish
Date of release: September 2010.
JRC-Names - a multilingual named entity resource
JRC-Names
is a highly multilingual named entity resource for person and organisation
names that has been compiled over seven years of large-scale multilingual
news analysis combined with Wikipedia mining, resulting in 205,000
person and organisation names plus about the same number of spelling
variants written in over 20 different scripts and in many more languages.
It can be used for a number of purposes, including the improvement
of name search in databases or on the internet, seeding machine
learning systems to learn named entity recognition rules, improve
machine translation results, and more. Details
on this resource can be found at the URL: http://langtech.jrc.ec.europa.eu/JRC-Names.html.
Languages: JRC-Names covers many
different languages, including: Arabic, Bulgarian, Chinese, Danish,
Dutch, English, Estonian, Farsi, French, Georgian, German, Greek,
Hebrew, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese,
Romanian, Russian, Slovene, Spanish, Swahili, Swedish, Thai and
Turkish.
Date of release: September 2011.
JEX - JRC EuroVoc Indexer
JEX is multi-label classification software that automatically assigns a ranked list of the over six thousand descriptors (classes) from the controlled vocabulary of the EuroVoc thesesaurus to new texts. JEX has been trained for twenty-two EU languages. The software allows users to re-train the system with their own documents, or with a combination of their own documents and the data provided together with the software. JEX can also be trained using classification schemes other than EuroVoc. Details and download links can be found at the URL: http://langtech.jrc.ec.europa.eu/Eurovoc.html.
Languages: Bulgarian, Czech, Danish,
Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian,
Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,
Slovak, Slovene, Spanish, Swedish.
Date of release: May 2012.
More resources will be made available in
the future