Entertainment at it's peak. The news is by your side.

How to Clean Text Data at the Command Line


Photograph by JESHOOTS.COM on Unsplash

Cleaning records is be pleased cleaning the partitions in your condominium, you clear any scribble, seize the mud, and filter out what is pointless that makes your partitions hideous and derive rid of it. The equal factor occurs when cleaning your records, it’s filtering what we need and taking away what we don’t must make the raw records precious and now not raw anymore. That you would be in a position to per chance even derive the cleaning with Python, R, or no matter language you secure however in this tutorial, I’m going to show methods to clear your text files on the sigh line  files by giving insights from a paper researching clickbait and non-clickbait records.

This tutorial is mostly motivated by Data Science on the Portray Line

Disclosure: The two Amazon hyperlinks for the ebook (in this portion) are paid hyperlinks so whereas you settle the ebook, I will have a small price

This ebook tries to deem your attention on the flexibility of the sigh line must you derive records science projects – that implies you can fabricate your records, manipulate it, stumble on it, and make your prediction on it using the sigh line. In the occasion you is liable to be a records scientist, intending to be, or must know extra about it, I highly counsel this ebook. That you would be in a position to per chance even be taught it online for free from its web dwelling or expose an ebook or paperback. On this tutorial we’re gonna focal point on using the sigh line to clear our records.

Pulling and running the docker image

To seize the bother of downloading files we kind out and dependencies we would possibly per chance per chance per chance per chance like, I’ve made a docker image for you that has all you need. You ethical pull it from docker hub and also yow will stumble on what it’s vital to play with to enable you to focal point on the cleaning portion. So let’s pull that image and then glide it interactively to enter the shell and write some sigh-traces.

$ docker pull ezzeddin/clear-records
$ docker glide --rm -it ezzeddin/clear-records
  • docker glide is a sigh to glide the docker image
  • the chance –rm is decided to snatch the container after it exists
  • the chance -it which is a combination of -i and -t is decided for an interactive direction of (shell)
  • ezzeddin/clear-records is the docker image identify

If using docker is light unclear for you, you can admire why we use docker tutorial

Cleaning text files

Let’s clear two text files containing clickbait and non clickbait headlines for 16,000 articles every. This records is extinct from a paper titled: Discontinue Clickbait: Detecting and Struggling with Clickbaits in Online News Media at 2016 IEEE/ACM Global Conference on Advances in Social Networks Diagnosis and Mining (ASONAM). Our goal here is to derive basically the most traditional words extinct in both clickbait and non-clickbait headlines.

In the occasion you checklist what’s internal the container you will admire two text files known as clickbait_data and non_clickbait_data. Let’s admire what the last output first that we would possibly per chance per chance per chance per chance like to derive. For the clickbait records, we need basically the most traditional 20 words to be represented be pleased this with their counts:

Image by the Author

And for basically the most 20 frequent words of non-clickbait headlines:

Image by the Author

Let’s admire how we can derive these histograms thru the sigh line by taking it diminutive by diminutive. After running the docker image, we’re now in a brand new shell with the brand new environment Let’s first admire what the clickbait_data file has by getting the principle 10 traces of it:

So it looks this file has headlines which would possibly per chance per chance well be labeled as clickbait as you can admire:

Ought to I Uncover Bings

Which TV Female Friend Neighborhood Carry out You Belong In

The New "Well-known person Wars: The Force Awakens" Trailer Is Here To Give You Chills

This Vine Of New York On "Megastar Colossal Brother" Is Fucking Ideally excellent

A Couple Did A Kindly Photograph Shoot With Their Microscopic one After Discovering out She Had An Inoperable Brain Tumor

And whereas you use head for getting the principle traces of non_clickbait_data yow will stumble on:

Bill Changing Credit Card Suggestions Is Despatched to Obama With Gun Measure Incorporated
In Hollywood, the Easy-Money Expertise Toughens Up
1700 runners light unaccounted for in UK's Lake District following flood

Yankees Pitchers Change Fielding Drills for Striking Note
Wide earthquake rattles Indonesia; Seventh in two days

Coldplay's new album hits stores worldwide this week

U.N. Chief Presses Sri Lanka on Speeding Relief to War Refugees in Camps

We’re drawn to words here now not phrases, so we can derive words starting from 3 letters to extra with:

$ head clickbait_data | grep -oE 'w{3,}'

head clickbait_data is extinct here on myth of we’re doing statistics here on the couple of headlines on the end of the file which is piped to the next grep sigh grep -oE ‘w{3,}’


   -oE -o for getting handiest matching words and -E for using prolonged frequent expression which is the next sample

    ‘w{3,}’ this sample is be pleased ‘www+’ which goes whole words with 3 letters or extra

In expose to derive the counts of every note we would possibly per chance per chance per chance per chance like to first derive the original words which we can derive by uniq sigh with the chance -c to provide you with counts, however to let uniq  delete reproduction words it’s vital to kind first:

$ head clickbait_data | grep -oE 'w{3,}' | kind | uniq -c

This sigh is done on the principle 10 traces, let’s derive it all the scheme in which thru the total clickbait headlines:

$ cat clickbait_data | grep -oE 'w{3,}' | kind | uniq -c | kind -nr | head
  • cat clickbait_data | grep -oE ‘w{3,}’ | kind | uniq -c we’re now placing this sigh (to derive the total words all the scheme in which thru the clickbait records) into the frequent input of the next sigh
  • kind -nr to kind numerically in a reverse expose to derive the ideal depend first
  • head to derive the principle 10 frequent words

Here is the output of the old sigh:

   5538 You
   4983 The
   2538 Your
   1945 That
   1942 Are
   1812 This
   1645 And
   1396 For
   1326 What
   1242 Will

Appears be pleased we’re end now to be in an ethical form, let’s admire what we can derive to larger clear that up.

If we derive a deeper admire

Photograph by David Travis on Unsplash

We are in a position to admire we’re missing small letters and all caps letters. As an illustration, for the ‘You’ note we’re missing ‘you’ and we’re also missing ‘YOU’. Let’s strive to spy if these words exist already:

$ cat clickbait_data | grep -oE 'w{3,}' | kind | uniq -c | kind -nr | grep you
$ cat clickbait_data | grep -oE 'w{3,}' | kind | uniq -c | kind -nr | grep YOU

In expose we can admire:

We’re missing 2 words that every can contribute to our counts to the ‘You’ occurrence and ‘Your’ to make them 5540 and 2540 respectively.

What we would possibly per chance per chance per chance per chance like to derive first is to convert every capital letter into small ones using tr which is a sigh-line utility that translates characters:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE 'w{3,}' 
| kind | uniq -c | kind -nr | head

tr ‘[:upper:]’ ‘[:lower:]’ here translates the contents of clickbait_data into decrease-case. [‘upper’] is a personality class that represents all larger case characters and [‘lower’] is a personality class that represents all decrease case characters.

To prepend these values with the header, we can use sed to position two column names to signify every column:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE 'w{3,}' 
| kind | uniq -c | kind -nr | sed '1i depend,note' | head 

sed ‘1i depend,note’ so we’re having depend representing the need of occurrences and note obviously representing the note

1i is extinct here to write these two words on the principle line and the commerce within the file will be in-space


5540 you
4992 the
2540 your
1950 that
1944 are
1812 this
1653 and
1397 for
1326 what

In expose to print that in a blinding form we can use csvlook that would possibly per chance derive us this:

| depend        | note |
| ------------ | ---- |
|    5540 you  |      |
|    4992 the  |      |
|    2540 your |      |
|    1950 that |      |
|    1944 are  |      |
|    1812 this |      |
|    1653 and  |      |
|    1397 for  |      |
|    1326 what |      |

which is now not stunning at all. The reason this came about is that csvlook works as its identify indicates to a greater admire for a CSV file so we must have a CSV (Comma Separated Payment) file first. We should always then get a type to separate every trace at every line with a comma. At this point, we can use awk which is a sample-directed scanning and processing language:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE 'w{3,}' 
| kind | uniq -c | kind -nr | awk '{print $1","$2}' | sed '1i depend,note' | head | csvlook

awk ‘{print $1″,”$2}’


   print $1 here prints the principle discipline (which is the depend column) adopted by…

          “,” a comma adopted by…

          $2 the 2d discipline which is the note column


It looks we’re in a lot larger form now:

| depend | note |
| ----- | ---- |
| 5,540 | you  |
| 4,992 | the  |
| 2,540 | your |
| 1,950 | that |
| 1,944 | are  |
| 1,812 | this |
| 1,653 | and  |
| 1,397 | for  |
| 1,326 | what |

If we would possibly per chance per chance per chance per chance like to derive the note column within the principle discipline and the depend column within the 2d discipline, we ethical must reverse the expose within the awk and sed sigh:

$ cat clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE 'w{3,}' 
| kind | uniq -c | kind -nr | awk '{print $2","$1}' | sed '1i note,depend' | head | csvlook

In expose to derive the identical output for the non-clickbait records, we ethical must commerce the filename:

$ cat non_clickbait_data | tr '[:upper:]' '[:lower:]'| grep -oE 'w{3,}' 
| kind | uniq -c | kind -nr | awk '{print $2","$1}' | sed '1i note,depend' | head | csvlook

Getting perception into the clickbait look

On this look as reported by the paper, it addresses the clickbait and non-clickbait headlines so as to detect both as

Many of the secure news media stores depend intently on the revenues generated from the clicks made by their readers, and attributable to the presence of a huge need of such stores, they must compete with every other for reader attention.

So what’s in this tutorial is a type to clear up records thru the sigh-line so as that we can derive some insights in regards to the final result of this paper and admire if we can derive some substances claimed by the paper thru their evaluate.

Again, let’s admire the last distribution of basically the most 20 frequent words for clickbait headlines for this records is:

Image by the Author

we can obviously admire the intense use of the possessive case you and now not using the third person references be pleased he, she, or a particular identify

and also your as it would possibly per chance per chance well appear in phrases most often extinct in clickbait be pleased in ‘Will Blow Your Mind’. Also, we can get frequent words of determiners be pleased this, that, which that

make the person pondering about object  being referenced and persuade them to pursue the article further.

On the opposite hand, the distribution for basically the most 20 frequent words for non-clickbait headlines is:

Image by the Author

We are in a position to admire here non-possessives be pleased australian, president, obama and another words that would possibly per chance per chance per chance happen in both.

Final thoughts

The clickbait paper suggests grand deeper investigations than we did here, however we would possibly per chance per chance well derive some treasured insights with ethical one line of code on the sigh line. We realized tips on how to utilize tr to translate characters, extinct grep to filter out words starting from 3 letters, extinct kind and uniq to derive a histogram of note occurrences, extinct awk to print fields in our desired positions, extinct sed to position the header to the file we’re processing.

Thanks for reaching here!

Motivated by

Extra tutorials?

In the occasion that it’s good to always spy extra tutorials blog posts, take a look at out:

Desire to boost me?

Please clap on medium whereas you be pleased this text, thank you! 🙂

Read More

Leave A Reply

Your email address will not be published.