вторник, 8 декабря 2015 г.

Multilayer Network of Language

Multilayer Network of Language: a Unified
Framework for Structural Analysis of Linguistic
Subsystems
Domagoj Margan, AnaMeˇstroviґc, SandaMartinˇciґc-Ipˇsiґc
Department of Informatics,
5 University of Rijeka,
1 RadmileMatejˇciґc 2, 51000 Rijeka, Croatia

http://arxiv.org/pdf/1507.08539.pdf
image

Если мы ищем наиболее частые сочетания слов, то мы рассматриваем только один слой – словесный как бы. Под ним находятся скрытые при таком подходе слои – слоги и буквы. Subword layers. Фишка в том, что частота корреляций между слогами как-то там влияет на частоту корреляций между словами.

На уровне здравого смысла –  если последний слог одного слова сочетается с первым слогом другого так, что не выговоришь, то и слова такие вместе, друг за другом будут встречаться реже. Хотя по смыслу вроде слова подходят.

Банально, но правильно.

И вывод:

These findings reveal a variety of new and thrilling questions which will open
new paths for future research in network linguistics. Хотя, конечно, вот это уже навряд ли.

 

Discussion and Conclusion


The presented findings show that standard networkmeasures on isolated layers
exhibit no substantial differences across layers, only slight variations between
word and subword levels. Although, if we compare the structural differences
across the examined languages there are indications of different principles in
their organization. For instance, English is characterized by higher clustering,
with the exception of the syllabic layer. TheEnglish syllabic layer has 54 compo-
nents, while Croatian has 17, which is reflected in the low clustering coefficient
of English syllables. This is caused by high flectivity of Croatian, where many
words share the suffix - the last syllable, which decreases the number of compo-
nents, and increases the clustering coefficient. This observation raises a question,
which properties will the morpheme language subsystem expose during the in-
corporation into a multilayer language framework?
Evena standarddistributionanalysis is not sufficient to take adeeper insight
into themutual influencesbetweensubsystems of language.The (in-/out-)degree
andstrengthdistributions of theword-level layers areoverlappeddue to the same
word frequencies reflected from the same data source. Therefore, the standard
approach to study the structure of linguistic networks showed no discrepancies
among layers. However, the (in-/out-) selectivity values are potentially capable
of quantifyingdifferences, namely to showthe potential of revealing the interplay
among the layers.
The inter layer degree and strength correlations suggest that CO-SHU layers
are more related than the CO-SIN, and SIN-SHU pairs, due to the preserving
Zipf’s lawduring shuffling [31] (reflecting theutilizationof the samedata source).
In-distributions for syntax layers in both languages have higher values than the16 Domagoj Margan, AnaMeˇstroviґc, SandaMartinˇciґc-Ipˇsiґc
corresponding out-distributions, and generally SIN is less inter correlated than
the CO and SHU layers. The inter and intra layer correlations in the multilayer
language network suggest the manifestation of different governing principles in
the syntax structure of the examined languages. The interesting part is that this
is the first observable indication of differences between languages manifested
in amultilayer analysis framework, which encouraged a deeper investigation. In
addition, the selectivitydistributions (regardless of side or layer or language) are
not correlated, supporting the potential of selectivity as a measure capable to
quantify structural differences across language subsystems. Moreover, Croatian
exhibits higher correlations then English in general.
The examination of the word-level layers overlap reveals additional insights
into the mutual interplay between the layers. The weighted overlap provides a
thorough insight into the intersection of links between network layers. It seems
thatWO is more appropriate to approximate the overlaps of layers in weighted
networks than the commonly employed Jaccardmeasure. As expected, CO-SIN
layers are more overlapped than shuffled pairs, and Croatian syntax is better
captured throughwords co-occurrences than the English. The preservedweights
on intersected links indicate that around 10% of the co-occurrence frequencies
are not consistent with overlapped syntax dependencies. The proposedmeasure
of preservedweighted overlap seems adequate to quantify the similarity of word-
level layers in weighted and directedmultilayer networks of language.
The subword layer’s analysis reveals that the syllabic layerplays an important
role in the manifestation of principles governing the construction of word layer,
which is different for the examined languages. The graphemic layers, on the
other hand, share characteristics, which are reflections of the high density of the
graphemic networks (almost complete graphs in both languages).
Theobtainedmultilayered languageanalysis resultsmanifestdifferentdriving
principles beneath the co-occurrence, shuffled, syntactic, syllabic and graphemic
layers, which was not obvious through the analysis of isolated layers. In order
to obtain deeper insight into these relations we utilize the analysis of motifs,
which reveal a close topological structure in the syntactic and syllabic layers of
both languages. The correlations of themotifs’ frequencies aremore emphasized
inCroatian. The triad significance profiles (TSP) are correlated between syntax
and syllables regardless of the language, while English additionally exhibits a
correlation between co-occurrence and syntax layers. It seems that the observed
TSP correlations reflect the properties of the Croatian - the free word-order
which caused different characterizations of the co-occurrence and syntax layers.
Moreover, the high flectivity of Croatian is reflected in many suffixes realized
by syllables. Therefore, the structure of layers also reflects the morphological
properties inherent to the language, which should we examine more deeply in
the future.
Our findings are in line with previous observations in language networks
research. For instance, Ferrer i Cancho [35] reports that the amount of syntac-
tically incorrect links in co-occurrence networks can increase to a high of 70%,
and elaborates: ”About 90% of syntactic relationships take place at a distanceMultilayer Network of Language 17
lower or equal than two, but word co-occurrence networks lack a linguistically
precise definition of link and fail in capturing the characteristic long-distance
correlations of words in sentences.” This adequately explains the driving princi-
ple of the CO-SIN relationships which we have confirmed in this research. Still,
an explanationof the linguistic grounding for the SIN-SYL relationships remains
an open challenge.
Our results strongly suggest that thereare somepropertieswhichare inherent
in the word-level layers and not for the subword layers; while some are inherent
in theword-subword relations.More precisely, it seems that syntax and syllables
exhibit influences of the same linguistic phenomena.
Conclusion. Inthis researchweuse themultilayernetworks framework toexplore
various language subsystems interactions. Multilayer networks are constructed
fromfive variations of the same original text: three on theword-level (syntax, co-
occurrence and its shuffled counterpart) and two on the subword level (syllables
and graphemes). The analysis and comparison of layers at word and subword
levels is employed in order to determine the mechanism of mutual interactions
between different linguistic units.
The presented findings corroborate that the multilayer framework canmeet
the demands in expressing the complex structure of language.According to these
results one cannotice substantial differences between the networks’ structures of
different language layers, which are hidden during the exploration of an isolated
layer, regardless of modeled language (e.g. Croatian or English). Therefore, it is
important to include all language layers simultaneously in order to capture all
language characteristics in the systematic exploration.
The multilayer network framework is a powerful, consistent and systematic
approach to model several linguistic subsystems simultaneously and to provide
a more general view on language. The word-level layers can be represented as
multiplex networks (the coupled links have 1:1 or 0:1 inter-connections), while
the connections between word and subword layers are not coupled (have N:M
inter-connections). Hence, defining the unified theoretical model for the mul-
tilayer language networks is essential for further endeavors in the research of
linguistic networks.
These findings reveal a variety of newand thrilling questions which will open
new paths for future research in network linguistics. To conclude, we are at
the very beginning of an exciting and challenging pursuit. Hence, our future re-
search plans involve: exploring the relationships of other languages’ subsystems
(i.e. morphological, phonetic), defining the theoretical model capable of captur-
ing all structural variations of language subsystems’ relationships and eventually
explain the governing principle of mutual interactions and conceptual universal-
ities in natural languages.

Мемы&медиавирусы

Loading...