M. Kubát, R. Čech a naše nová doktorandka K. Pelegrinová společně s kolegy J. Hůlou a D. Čížem z Ústavu pro výzkum a aplikace fuzzy modelování dnes prezentovali výzkum zaměřený na analýzu tzv. kontextové specifičnosti na mezinárodní korpusové konferenci SlaviCorp v Praze.
Context Specificity of Lemma. Diachronic analysis
The study deals with the application of the neural networks in the linguistic research
of word semantics. A recently proposed method of measuring Context
Specificity of Lemma (Čech et al. 2018) based on Word Embeddings Word2vec
technique (Mikolov et al. 2013) is introduced and illustrated in the analysis of
the selected lemmas from various fields (e.g. political discourse or IT). The research
is based on the fourth version of SYN series corpora of Czech National
Corpus (Hnátková et al. 2014). The results indicate that the method is applicable
for detecting the semantic development of a lemma and it could have a
potential for linguistic studies. Although neural networks are generally blackbox
methods, our approach enables the linguistic interpretation of the obtained
results. The aim of this contribution is to introduce a method which can detect
semantic changes of a lemma from the diachronic viewpoint.
In word embedding methods, each lemma is represented by a vector. The
size and the orientation of a vector express the position of a lemma in a semantic
multi-dimensional space. Therefore, it is possible to measure similarities
among lemmas. If, in an ideal case, there are two lemmas which occur in the
identical contexts in the whole corpus, the size and orientation of these two
vectors would be identical and, thus, the distance between these two lemmas
equals to zero or, reversely, the similarity between them equals one. In the reality,
each lemma occurs in different contexts, consequently, they are represented
by different vectors which enables us to compute similarities among them.
The method Context Specificity of Lemma (CSL) measures how unique
is the context in which the lemma appears in the corpus. Specifically, if the
lemma occurs in many different contexts, it will have low context specificity.
The context in which the lemma appears is captured with a distributed vector
representation which is assigned to every lemma. In this vector representation,
it is possible to measure the similarities among lemmas. To be more
specific, it means that for each lemma, we can compute its similarity to all
other lemmas. Statistics of these similarities (e.g., a mean value) can be used
for characterizing the Context Specificity of Lemma. The lower the mean of
similarities, the higher the CSL.
Neural networks need huge training data sets to be capable of producing
reliable results. We therefore decided to use one of largest Czech corpora –
the fourth version of SYN series corpora (Hnátková et al. 2014). The size of
the SYN_V4 is 3,626 billion tokens. The SYN corpus is not representative;
the dominant component is journalism. Beside journalism there are other
two text types: fiction and technical literature. Only journalistic texts were
selected for the analysis. The final corpus of our study consists of more than
3 billion tokens (3,045,389,630) and more than one hundred thousand types
(102,707). In order to avoid a bias caused by low frequencies, all lemmas with
frequency less than 70 were omitted (f ≤ 69). Since the goal is to analyse diachronic
development of the CSL, we divided the data into 19 subcorpora that
each represents one year. Only the subcorpus 1990-1996 consists of texts
from several years because of the small data sizes.