The Malagasy-language Wiktionary is full of errors
Audit of Malagasy Wiktionary
Bot-Jagwar is a bot account dart by Jagwar. At mg.wikt, it has made 22,828,226 edits (and counting), catapulting mg.wikt to be the 2d-biggest Wiktionary, with a entire of 6,103,961 entries (and counting). (Indicate that as bot edits are continuing, all these numbers will likely be outdated-long-established.) Jagwar has a secondary bot account, Bot-Jagwar II, which has easiest made 6,976 edits. One other most foremost bot contributing to mg.wikt, making the right same form of edit, is Ikotobaity, with 2,456,748 edits (dart by Lohataona except 2017; now slothful). These three bots rep created 6,076,769 recent mainspace pages (and counting), which is 99.23% of all mainspace pages on mg.wikt. (Jagwar also ran bot edits on his most foremost account, so the upright selection of bot-created entries is set 50,000 greater.)
In this weblog post, he little print the historic past of his bot and mg.wikt. He makes use of NLP and automatic translation in represent to generate recent entries, without any human intervention or oversight. To cite Jagwar himself: “However as time passes loads of pages rep created, and even with loads charge of error, you cease up with hundreds of pages of potentially nasty info.” (emphasis no longer mine) So he is conscious of these entries are nasty, but simply would not care.
The cause that no action has been taken at mg.wikt is that Jagwar is the sole admin who has made edits, and there is no active bettering community. Jagwar himself has easiest made 6 edits within the last 90 days, of which easiest 3 were in mainspace. Even an bettering community of the dimensions of the biggest Wiktionary, en.wikt, wouldn’t be in a discipline to trim up after these bots by hand.
Complications with non-Malagasy entries on mg.wikt
Of the 4,953,779 (and counting) non-Malagasy entries on mg.wikt, the large majority were created by these bots primarily based mostly on computerized translation from other Wiktionaries, chiefly en.wikt and fr.wikt. These translations can even be attempting in diverse programs. A few of them rep with regards to nice definitions, but are missing crucial lexicographical info that makes the entry as a full misleading, e.g. mg:wikt:nigger is translated as mainty, which is an adjective that simply methodology “black” — here’s obviously problematic protection of a extremely offensive word. Others are inaccurate because easiest one share of the entry is translated, e.g. mg:wikt:cirugía plástica (Spanish for “cosmetic surgery”) is translated as fandidiana, which correct methodology “surgery”. Still others are inaccurate since the entry used to be parsed incorrectly, e.g. mg:wikt:match#Espaniola (Spanish for “match”, as in a sporting match) is translated as mahaleo, afokasoka, which is nonsensical — the most foremost word methodology “to be equal (to), to compare” and the 2d “match [device used to light a fire]”. Here the bot used to be attempting to hedge its bets by giving multiple, mutually outlandish interpretations of what English “match” would possibly possibly possibly also indicate, and yet both are inaccurate! Many others are no longer wildly nasty, but light unnecessary, e.g. mg:wikt:duniani (Swahili locative construct which methodology “in/on the field”) is translated as giloby, which methodology “globe”.
Inflected forms of phrases in non-Malagasy languages were bot-created for diverse languages, including Spanish. Many of these are in most cases nice in their impart material, but the presentation is misleading at easiest; at mg:wikt:afilan, two definitions are given, but one components to the suffix -ar in discipline of the word itself, and the other makes use of the English word “default” within the definition, inaccurately. Nevertheless, a foremost share of non-lemma entries appear like inaccurate, due to wierd bot errors, e.g. mg:wikt:consorcíate, which tries to hyperlink to an obviously inaccurate entry “sense=affirmative”. There are 24,953 entries linking to “formal=n”, 17,847 entries linking to “formal=y”, 23,337 entries linking to “particular person=1”, and tons hundreds more with an identical errors.
Some entries weren’t created primarily based mostly on other Wiktionaries, but reputedly primarily based mostly on dictionary entries, causing weird and wonderful errors fancy mg:wikt:singing broken-down sakalava accompany the drum, which is claimed to be a word in French (!).
When an entry on one more Wiktionary is deleted, renamed, or corrected, the copy of it made on mg.wikt is by no methodology modified, ensuing in one more source of error, despite the fact that likely a grand smaller one. As an illustration, the Kinyarwanda share on mg:wikt:bogobogo is inaccurate, because it is primarily based mostly on fr:wikt:bogobogo, which used to be deleted earlier this yr, but the bot had already created an entry on mg.wikt.
Moderately few of these entries are marked in any manner for the reader to beware. Of the entries so marked, there are 406,725 entries marked as translated from en.wikt, 107,307 entries marked as translated from fr.wikt, and 119,294 entries from other Wiktionaries. The cause at the lend a hand of this appears to be like to be that this categorisation for entries desiring to be verified, which is accompanied by a template that warns the reader that the entry has been translated, is a most neatly-liked addition, as older translated entries lack it.
Quantifying the velocity of error
Handiest a cautious inspection can declare the extent of errors, which is rarely any longer imaginable to your entire millions of entries on mg.wikt. I assessed a random subsample of 100 pages with at the least one non-Malagasy lemma entry. The stout listing of entries with their assessments, including little print on any problems, is at Diminutive wiki audit/Malagasy Wiktionary/100. I stumbled on that 49/100 were if truth be told unusable, as they’d serious errors or omissions. A further 29/100 were easiest partially usable, due to foremost omissions that did no longer upward push to the stage of being outright errors. Handiest 22/100 appear like totally nice and usable, of which 2 are perilous and integrated to be beneficiant. Assuming here’s a representative subsample, as there is no cause no longer to originate so, this means that around half of of all non-Malagasy lemma entries are inaccurate, and easiest around a fifth are totally usable (and even a entire lot of these rep minor errors!). This more or much less persistently low quality would be grounds for blocking if done by a human editor on any Wiktionary.
Complications with Malagasy entries on mg.wikt
There are 41,902 entries categorized as missing any definition, most of which appear like Malagasy entries, and around 30,000 of which would possibly possibly possibly be the outcome of the definitions being eradicated due to copyright violation decades within the past. Even if there are 1,150,182 Malagasy entries in entire, all these are inflected kinds, that will in most cases be safely created by bots. These definitionless entries are no longer strictly speaking inaccurate, but a definition is the most central characteristic of a dictionary, so these entries fail to be a precious share of the dictionary as a full.
Moreover, there are 6,319 successfully definitionless Malagasy entries no longer counted within the table below, fancy mg:wikt:matoanteny, where the word to be defined is given because the definition, in discipline of giving an staunch definition, or mg:wikt:mahaketrona, where the definition is clean. Some cases, fancy mg:wikt:tamboho, rep two identically duplicated Malagasy sections, every of which simply gives the word to be defined (listed twice) because the definition. This more or much less entry is rarely any longer even categorized as desiring a definition, but is equally unnecessary as a dictionary entry, and the duplication of share reflects the bots’ lack of capacity to seem at unique Wiktionary formatting.
The bot-added translation sections in Malagasy entries are also largely inaccurate. As an illustration, mg:wikt:recent york#Malagasy methodology “the”, but amongst the translations given are “so, her, him, them” for English, “Net” for Afrikaans, “Herodianus, sarawakensis, bogotensis, beijingensis, herous, colon, parasceve” for Latin, “orchestra, banana, ataraxia” for Romanian, and tons, many more examples of absurd mistranslation on that one entry on my own.
A fluent Malagasy speaker used to be consulted in represent to assess the correctness and grammaticality of the Malagasy extinct in definitions. He concurred with the classic problems identified here, and acknowledged that some Malagasy entries, fancy mg:wikt:ady fom-pananana, are defined with incomplete sentences. In regard to both Malagasy and non-Malagasy entries, he mentioned that they’re “hit and miss on whether or no longer the certainty is precious or no longer”, without assessing the accuracy of the certainty. As neatly as to these impart material components, some bot-created Malagasy entries, fancy mg:wikt:navadika, would possibly possibly possibly also rep nice impart material but are so misformatted that they’re infrequently recognisable as Wiktionary entries.
Quantifying the velocity of definitionless entries
There are at the least 47,379 definitionless Malagasy entries in entire (alongside with 24,626 definitionless non-Malagasy entries). This entire and the table below originate no longer embrace about 1,423 Malagasy entries of a form proven by mg:wikt:ambaratonga, where the definitions are round and subsequently the dictionary gives synonyms, but the entries themselves are successfully definitionless.
|Fragment of Speech||Entries||No Definition||Notes|
|Shining nouns||1,187||3||All of these are defined as “title of particular person”, “title of discipline”, and so forth.|
|Roots||304||0||All defined as root kinds from verbs.|
To this level, no external action has been taken because despite discussions, Jagwar continues to dart his bot without penalties. To cite him, “However this mass-including impart material, in particular in language I didn’t discuss the least bit, looked as if it would annoy of us that rep determined to discuss regarding the case on MetaWiki forum. No concluding outcomes used to be given, and issues were as they were sooner than.” We would in reality like to trade this.
I strongly advocate that every non-Malagasy entries created on mg.wikt by Bot-Jagwar, Bot-Jagwar II, Ikotobaity, and Jagwar’s bot dart below his non-public account be deleted, and your entire translation sections in Malagasy entries be eradicated. I further strongly advocate that the homeowners of these bots, Jagwar and Lohataona, be warned no longer to use them to make more entries at any Wiktionary ever again, or else the bots will likely be globally blocked.
I weakly advocate that every definitionless Malagasy entries on mg.wikt created by these bots be deleted. This is rarely any longer actively harming the dictionary within the same manner as inaccurate impart material, but it is decreasing the mark-to-noise ratio and usefulness of the dictionary.
Extra work: Complications at other Wiktionaries
Jagwar ran his bot at one more language Wiktionaries, in some cases the use of the same automatic translations and producing questionable impart material that those Wiktionaries rep no longer checked.
- 218,156 edits at chr.wikt from 2012 to 2014, with regards to all unedited by other folks. These populate the category chr:wikt:Category:Entry to be checked, which at declare contains 185,434 entries. There are no active editors at chr.wikt.
- 127,389 edits at ku.wikt from 2012 to 2013, with regards to all unedited by other folks. The Malagasy entries here embrace a fleshy selection of verbs that are simply defined because the declare anxious of that very verb, thus missing any staunch definition, e.g. ku:wikt:mivoendre. Even if no longer inaccurate, these are in point of fact undefined entries.
Edits on other Wiktionaries were primarily including Malagasy lemmas (at fr.wikt and at en.wikt, where he extinct his most foremost account to dart bot edits) or including interwiki links, so no most foremost damage seems to rep been done in other areas. Nevertheless, editors at those Wiktionaries would possibly possibly possibly also light light be suggested to gaze over his edits, as they light contain frequent errors in definitions, share of speech project, and more.