JRC-Names:
A highly multilingual
named entity resource
Release
version 1.0: JRC-Names was released in September 2011.
Find out about further
multilingual resources distributed via the JRC.
JRC-Names
is a highly multilingual named entity resource for person and
organisation names (called 'entities'). It consists of large lists
of names and their many spelling variants (up to hundreds for
a single person), including across scripts (Latin, Greek, Arabic,
Cyrillic, Japanese, Chinese, etc.). The named entity resource
file with the list of spelling variants is accompanied by Java-implemented
demonstrator software that (a) allows to produce - for any input
name - a list of known spelling variants, and that (b) analyses
UTF8-encoded text files to find known entity mentions, returning
the name variant found, the preferred display name for that entity,
the unique name identifier for that name, the position of the
entity name in the text, and its length in characters.
To see examples, go to any of the over one million entity pages
on EMM-NewsExplorer (e.g. that for the United
Nations) to see the list of spelling variants automatically
collected for that entity. Below, you see known spelling variants
for the person name Muammar Gaddafi:

The data release by the Joint Research Centre (JRC)
is in line with the general effort of the European Commission to
support multilingualism, language diversity and the re-use of Commission
information.
JRC-Names is a technical resource that can be used to
find names even if they are spelled differently, but it is also
a useful ingredient for IT systems that process text, e.g. for
text mining. The tool serves many purposes and addresses various
problems, including the following:
-
Proper names
are a problem when searching databases, the internet and other
repositories, because variants of searched names are often not
found. This results in non-optimal use and exploitation of repositories
for documents, images and audio-visual content. JRC-Names allows
standardising the names and thus improving retrieval;
-
Names are
a known problem for machine translation as they should not
be translated like other words; names can be extracted before
the translation process and the foreign language variant can
be re-inserted in the target language to solve this problem;
Lists of names in two different scripts are often used to
learn transliteration rules;
-
Names can be recognised and marked up in text to use as seeds when training
a machine learning named entity recognition system;
-
Social networks are less
biased by national viewpoints if produced using multi-national
sources and entity lists;
-
Recognition of names is useful
as input to the computational linguistics tasks of opinion mining,
co-reference resolution, summarisation, topic detection and
tracking, cross-lingual linking of related documents across
languages, and more.
JRC-Names is a by-product of the analysis of about 100,000 news
reports per day by the Europe
Media Monitor (EMM) family of applications.
It was mostly compiled automatically, by analysing hundreds of
millions of news articles since the year 2004 in up to twenty
languages, identifying names of entities (mostly persons, but
also organisations, event names, and more), and detecting which
of these newly found names are variant spellings of each other.
Most name variants in JRC-Names are thus spellings that were found
in real-life text (including frequent spelling mistakes). Additionally,
for a subset of the collection of entities, software automatically
extracted spelling variants in many further languages (e.g. Chinese,
Thai, Japanese, ...) from the cross-lingual links in Wikipedia.
For highly frequent or otherwise important names, the named entity
resource was additionally manually verified. As JRC-Names was
mostly produced automatically, it will contain some errors.
For details, you can read the publication JRC-Names:
A freely available, highly multilingual named entity resource.
JRC-Names contains the most important names of the EMM name database,
i.e. those names that were found frequently or that were verified
manually or found on Wikipedia.
The first release of JRC-Names (September 2011) contains the
names of about 205,000 distinct known entities, plus about the
same amount of variant spellings for these entities. Additionally,
it contains a number of morphologically inflected variants of
these names. The resource grows by about 230
new entities and an additional 430 new name variants per week
(status July 2011).
EMM identifies new names every day, and a file including also
the most recently found names and name spellings is available
for daily download from the JRC's web pages.
As of July 2011, the database included names spelt in 27
different scripts. The most frequently used scripts are
Latin (including English and most other European languages), Cyrillic
(e.g. Russian and Bulgarian), Arabic (including Farsi), Japanese
(Han, Hiragana and Katakana) and Chinese Han (simplified variant).
64% of the names in JRC-Names do not have additional spelling
variants. For 28% of the names, JRC-Names knows two or three spellings.
There are 3760 entities with ten spellings or more, and 37 entities
with over 100 spelling variants. The names with the most
spelling variants are Muammar
Gaddafi (413 spellings), Mikhail
Saakashvili (256) and Mahmoud
Ahmadinejad (246) (status July 2011).
A description
of JRC-Names (version 1) was published in the paper below. Please
use this publication as a reference when you refer to JRC-Names.
You may want to check the web site http://langtech.jrc.ec.europa.eu
in the future for more up-to-date publications on the subject.
Steinberger Ralf,
Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van
der Goot (2011). JRC-Names:
A freely available, highly multilingual named entity resource.
Proceedings of the 8th
International Conference Recent Advances in Natural Language Processing
(RANLP). Hissar, Bulgaria, 12-14 September 2011. (PDF).
By downloading and/or using JRC-Names,
you agree to the usage conditions formulated in
the licence,
which is available at http://langtech.jrc.ec.europa.eu/Resources/LICENCE-EULA_JRC-Names_2011.pdf.
Depending on your needs, you may want to download part or all of
the following components:
-
JRC-Names
Java demonstrator code: This .jar file allows to analyse
UTF8-encoded text files to recognise known named entities. It
also allows to generate a list of all known variants for any
input name; Needs to be used in combination with the entity
resource file.
- JRC-Names
named entity resource file: This file contains the list of
names and their variants. It is planned that this file will be
updated daily in order to include the most recently added entity
names. (filename: entities.gzip; zipped size: ca. 4MB; unzipped:
ca. 13MB).
- JRC-Names
Java source code: You only need this if you want to integrate
the resource into your own environment.
- JRC-Names
documentation: This is the documentation for the Java software.
|