GistTree.Com
Entertainment at it's peak. The news is by your side.

Napkin-text-analysis is a Python tool to produce statistical analysis of a text

0

napkin text analysis - logo

Napkin is a Python instrument to uncover statistical diagnosis of a text.

Diagnosis aspects are :

  • Verbs frequency
  • Nouns frequency
  • Digit frequency
  • Labels frequency impartial like (Individual, organisation, product, space) as defined in spacy.io named entities
  • URL frequency
  • Electronic mail frequency
  • Mention frequency (everything prefixed with an @ image)
  • Out-Of-Vocabulary (OOV) notice frequency which implies any words initiating air English dictionary

Verbs and nouns are of their lemmatized form by default however the option --verbatim permits to retain the long-established inflection.

Intermediate outcomes are saved in a Redis database to permit the diagnosis of a couple of text files.

  • Python >= 3.6
  • spacy.io
  • redis (a redis server operating on port 6380 is required)
  • pycld3
  • tabulate
utilization: napkin.py [-h] [-v V] [-f F] [-t T] [-s] [-o O] [-l L] [--verbatim]
                 [--no-flushdb] [--binary] [--analysis ANALYSIS]
                 [--disable-parser] [--disable-tagger]
                 [--token-span TOKEN_SPAN] [--table-format TABLE_FORMAT]

Extract statistical diagnosis of text

no longer distinguished arguments:
  -h, --abet            assert this abet message and exit
  -v V                  verbose output
  -f F                  file to analyse
  -t T                  most trace for the head checklist (default is 100) -1 is
                        no restrict
  -s                    demonstrate the overall statistics (default is False)
  -o O                  output format (default is csv), json, readable
  -l L                  language frail for the diagnosis (default is en)
  --verbatim            Don't use the lemmatized form, use verbatim. (default
                        is the lematized form)
  --no-flushdb          Don't flush the redisdb, distinguished in the event you might perhaps well perhaps be attempting to
                        route of a couple of files and combination the outcomes. (by
                        default the redis database is flushed at each rush)
  --binary              location output in binary rather then UTF-8 (default)
  --diagnosis ANALYSIS   Limit output to a particular diagnosis (verb, noun,
                        hashtag, mention, digit, url, oov, labels, punct).
                        (Default is all diagnosis are displayed)
  --disable-parser      disable parser part in Spacy
  --disable-tagger      disable tagger part in Spacy
  --token-span TOKEN_SPAN
                        Gain the sentences where a particular token is found
  --table-format TABLE_FORMAT
                        location tabulate format (default is fancy_grid)

Generate all diagnosis for a given text

A sample file “The Prince, by Nicoló Machiavelli” is incorporated to test napkin.

python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 4

Instance output:

╒══════════════════╕
│ Top 4 of verb    │
╞══════════════════╡
│ 207 occurences   │
├──────────────────┤
│ will             │
├──────────────────┤
│ 137 occurences   │
├──────────────────┤
│ can              │
├──────────────────┤
│ 116 occurences   │
├──────────────────┤
│ construct             │
├──────────────────┤
│ 106 occurences   │
├──────────────────┤
│ might perhaps well              │
├──────────────────┤
│ 102 occurences   │
├──────────────────┤
│ would            │
╘══════════════════╛
╒══════════════════╕
│ Top 4 of noun    │
╞══════════════════╡
│ 206 occurences   │
├──────────────────┤
│ prince           │
├──────────────────┤
│ 120 occurences   │
├──────────────────┤
│ man              │
├──────────────────┤
│ 108 occurences   │
├──────────────────┤
│ advise            │
├──────────────────┤
│ 90 occurences    │
├──────────────────┤
│ contributors           │
├──────────────────┤
│ one              │
╘══════════════════╛
╒═════════════════════╕
│ Top 4 of hashtag    │
╞═════════════════════╡
╘═════════════════════╛
╒═════════════════════╕
│ Top 4 of mention    │
╞═════════════════════╡
╘═════════════════════╛
╒═══════════════════╕
│ Top 4 of digit    │
╞═══════════════════╡
│ 1 occurences      │
├───────────────────┤
│ 99775             │
├───────────────────┤
│ 84116             │
├───────────────────┤
│ 750175            │
├───────────────────┤
│ 6221541           │
├───────────────────┤
│ 57037             │
╘═══════════════════╛
╒═════════════════════════════════════════╕
│ Top 4 of url                            │
╞═════════════════════════════════════════╡
│ 5 occurences                            │
├─────────────────────────────────────────┤
│ www.gutenberg.org                       │
├─────────────────────────────────────────┤
│ 2 occurences                            │
├─────────────────────────────────────────┤
│ www.gutenberg.org/donate                │
├─────────────────────────────────────────┤
│ 1 occurences                            │
├─────────────────────────────────────────┤
│ www.gutenberg.org/license               │
├─────────────────────────────────────────┤
│ www.gutenberg.org/contact               │
├─────────────────────────────────────────┤
│ http://www.gutenberg.org/5/7/0/3/57037/ │
╘═════════════════════════════════════════╛
╒═════════════════╕
│ Top 4 of oov    │
╞═════════════════╡
│ 9 occurences    │
├─────────────────┤
│ Sforza          │
├─────────────────┤
│ 7 occurences    │
├─────────────────┤
│ Fermo           │
├─────────────────┤
│ 6 occurences    │
├─────────────────┤
│ Vitelli         │
├─────────────────┤
│ Pertinax        │
├─────────────────┤
│ Orsinis         │
╘═════════════════╛
╒════════════════════╕
│ Top 4 of labels    │
╞════════════════════╡
│ 339 occurences     │
├────────────────────┤
│ PERSON             │
├────────────────────┤
│ 305 occurences     │
├────────────────────┤
│ GPE                │
├────────────────────┤
│ 197 occurences     │
├────────────────────┤
│ CARDINAL           │
├────────────────────┤
│ 189 occurences     │
├────────────────────┤
│ ORG                │
├────────────────────┤
│ 131 occurences     │
├────────────────────┤
│ NORP               │
╘════════════════════╛
╒═══════════════════╕
│ Top 4 of punct    │
╞═══════════════════╡
│ 3440 occurences   │
├───────────────────┤
├───────────────────┤
│ 144 occurences    │
├───────────────────┤
├───────────────────┤
│ 32 occurences     │
├───────────────────┤
├───────────────────┤
│ 26 occurences     │
├───────────────────┤
├───────────────────┤
│ 11 occurences     │
├───────────────────┤
│ 1.F.3             │
╘═══════════════════╛
╒═══════════════════╕
│ Top 4 of email    │
╞═══════════════════╡
│ 1 occurences      │
├───────────────────┤
│ gbnewby@pglaf.org │
╘═══════════════════╛

Extract the sentences connected to a particular token

python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 4 --token-span "Vitelli"

╒═════════════════════════════════════════════════════════════════════════╕
│ Top 4 of span for Vitelli                                               │
╞═════════════════════════════════════════════════════════════════════════╡
│ 1 occurences                                                            │
├─────────────────────────────────────────────────────────────────────────┤
│ This duke entered                                                       │
│ Romagna with auxiliary troops, leading forces peaceable entirely of      │
│ French troopers, and with these he took Imola and Forli; however as they    │
│ looked unsafe, he had recourse to mercenaries, and employed the Orsini and │
│ Vitelli; afterwards discovering these unsure to tackle, unfaithful and   │
│ unhealthy, he suppressed them, and relied upon his hang males.             │
├─────────────────────────────────────────────────────────────────────────┤
│ The Florentines appointed Paolo Vitelli their captain,                  │
│ a person of considerable prudence, who had risen from a interior most diagram to the    │
│ best repute.                                                     │
├─────────────────────────────────────────────────────────────────────────┤
│ On the opposite hand, Messer Niccolo Vitelli has been considered in                   │
│ our hang time to waste two fortresses in Città di Castello in tell    │
│ to retain that advise.                                                     │
├─────────────────────────────────────────────────────────────────────────┤
│ And the                                                                 │
│ distinction between these forces will also be without complications considered if one considers     │
│ the adaptation between the repute of the duke when he had easiest the  │
│ French, when he had the Orsini and Vitelli, and when he had to depend     │
│ on himself and his hang troopers.                                        │
├─────────────────────────────────────────────────────────────────────────┤
│ And that his foundations bear been                                           │
│ impartial correct is considered from the truth that the Romagna waited for him better than a  │
│ month; in Rome, although half slow, he remained obtain, and although    │
│ the Baglioni, Vitelli, and Orsini entered Rome they found no followers  │
│ in opposition to him.                                                            │
╘═════════════════════════════════════════════════════════════════════════╛

Specify table format

In readable output, the format will also be location to any of the tabulate format supported. In case you’d like the head 10 of out-of-vocabulary
words from the text in GitHub markdown format.

python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 10 --diagnosis oov --table-format github

Top 10 of oov
9 occurences
Sforza
7 occurences
Fermo
6 occurences
Vitelli
Pertinax
Orsinis
Colonnas
Bentivogli
Agathocles
5 occurences
Oliverotto
Cæsar
Commodus

overview of processing in napkin

The title ‘napkin’ got here after a first sketch of the foundation on a napkin. The goal used to be also to present a easy text diagnosis instrument which is able to be rush on the nook of table in a kitchen.

napkin is free instrument under the AGPLv3 license.

Copyright (C) 2020 Alexandre Dulaunoy
Copyright (C) 2020 Pauline Bourmeau

Read More

Leave A Reply

Your email address will not be published.