Whether your are a newbie or professional in natural language processing or data mining, text pre-processing is a task that you need to go through at some point. Although there is a plethora of libraries which do the pre-processing of your text perfectly, none of them may have power, simplicity and flexibility that the command-line programs provide to you. In this tutorial, a few simple but essential programs are introduced for text pre-processing in command-line.
What we mean by text pre-processing is the normalisation of text by removing special characters (e.g. punctuation marks, non-Unicode characters etc.), tokens or changing the whole structure of the text for a specific purpose (e.g. adding <eos>
to the end of sentences for training neural networks). In other words, whatever required to prepare a text file to be processed is a part of text pre-processing. Later, we will discuss text processing tasks such as counting unique words in a document, sorting words by frequency, removing stop words, retrieving specific text structures.
Obviously, for one problem there may be more than one solution. However, in this tutorial, only a few techniques and programs are introduced. It will be up to you, your preference and the requirements of your tasks to decide which command fits better to your problem. Personally, I love the command-line and how each program can be used as a piece of puzzle, thanks to pipes, to solve sophisticated problems, particularly where processing time and performance matter.
System setup: The commands in this tutorial have been tested on GNU bash, version 5.0.2(1)-release (x86_64-apple-darwin18.2.0). You will normally have no problem running them on any Unix-based operating system. If you are a Windows user, the best solution is to start using a Linux-based open-source operating system like Ubuntu (seriously, do it!). Otherwise, you can simply install Cygwin which lets you enjoy GNU in you Windows, somehow!
If you are not familiar with command-line, particularly Bash–the command language interpreter for the GNU operating system, you should learn the basic concepts, features and commands. This reference manual may be useful. Recall that you may use the man
command to display the user manual of any command.
If you are interested in executing the commands in practice, you can download this text file in your working directory and see how each command works. Our sample text is very noisy; it contains html tags, indented paragraphs, multiple unnecessary newlines, spaces and tokens. Our goal is to clean this sample text by getting rid of the noisy stuff.
The very first step of any task in text (pre-)processing is text reading. We will see later in this tutorial that some commands, like wc
, take care of the text reading without requiring us to do any further commands for reading.
cat
cat [options] [filenames]
The cat
command (short for “concatenate“) is one of the most frequent commands which not only allows us to read files, but also create, display and concatenate them.
The following command prints out the whole file content in your command-line:
You may use the less
command which lets you scroll through the file without having all the file content displayed on the command-line. Don’t forget to use pipes, i.e. |
whenever you want to use the output of a command as the input of another command.
To exit the viewing window press q
.
You may write your file content in a new file like:
The difference between >
and >>
is that the first one overwrites the file if it already exists while the latter one keeps the content untouched and appends to the existing file.
cat
can also turn your command-line into a basic text editor as follows:
By using cat > quick_note.txt
, you are creating a new file and allowing cat
to write whatever you type in the command-line as the content of your file. Once you are done with typing, cntrl-d
saves the content and returns the command-line to its normal functionality.
The awesomeness of cat
is not limited to these! You can concatenate multiple files into one as follows:
This way, the content of those two files is merged into a new file called 2019_results.txt.
head
and tail
head [options] [file(s)]
, tail [options] [file(s)]
Sometimes you only want to take a look at how the content of a file looks like. head
and tail
are two commands which allow us to view a certain number of lines of a file, respectively at the beginning and the end.
There is an interesting option -n
which lets us to specify how many lines be displayed. The default value is 10 lines.
head
and tail
allow further manoeuvres by using them in various orders with pipes and different options. Suppose that you have file containing 1000 lines and you would like to read lines 286 to 300. One way to do so is the following:
awk
But what if you only want a specific range of lines in your file? Well, you can use awk
as the following:
where the first_line and last_line should be replaced by the line range that you are interested in. For instance, the lines 48 to 51 in our sample file are:
awk
is a very useful command in text (pre)processing that we will discuss later.
iconv
iconv -f source-encoding-code -t target-encoding-code < inputfile > outputfile
Although nowadays Unicode standard is the default encoding of most programs and on the web, there are still texts for which older encodings such as ISO 8859 and Windows-1252 were used and therefore, converting them is necessary. For this purpose, there is a program called iconv
which converts your text from an encoding to another one. The syntax of iconv
is shown above; Change the source-encoding-code
and the target-encoding-code
to the encoding of your source text and your desired output encoding, respectively.
Our sample text file is encoded in UTF-8. The following is an example to convert “crépuscule” from UTF-8 to ISO-8859-15:
Recall that echo
displays a text on standard output.
Any characters that iconv
does not know how to convert will be replaced with question marks. This may not be good when conversion accuracy matters. However, when dealing with large number of files, I personally prefer having a bunch of ?
s to prevent any text reading errors!
tr
tr [options] [set1] [set2]
Replacing or removing specific characters is also a part of text pre-processing. For instance, unnecessary spaces or newlines may not be welcomed for some tasks. tr
, abbreviation of translate or transliterate, is a very useful command which deals with such cases.
Suppose that you have the following text in Greek, where ;
is normally used as ?
, and you want to replace ;
with ?
:
or, replacing spaces by newlines:
Replacing characters can also be done with a sequence of characters as follows:
In this case, tr
simply translates a character in the first set “iplo” to its corresponding one in the second set “abnu”.
tr
also supports character sets such as a-z
, A-Z
and 0-9
to represent lower case alphabets, upper case alphabets and digits, respectively:
What if we want to replace multiple occurrences of a character by another character? For instance, line 340 to 363 in the sample text file contain multiple spaces that we would like to be removed and be replaced by one. This can be done with the -s
option of tr
:
The text looks like the following after removing multiple spaces and newlines:
To delete all the occurrences of a character, you can use tr -d [set]
which acts like tr [set] ""
:
To discover more interesting options, try man tr
.
The commands that we introduced so far were used directly over the text with simple patterns. What if we need to deal with more sophisticated text patterns? This is where regular expressions (regex) come in!
Regex allow us to access parts of a text based on the rules that we define. In some ways, a regex is analogous to a database query or a semantic query. It’s a query to extract information from raw text!
Let’s take a look at our sample text file where lines 1215 to 1220 contain HTML code (look like an HTML table):
Let’s assume that we want to remove the tags, i.e. anything between < and > and we only want to keep whatever enclosed by tags. To do so, we create a regular expression which looks for anything starting with a < and proceeds until it gets to a >; the regex includes anything between those two characters, except a >. This regex is: /<[^>]*>/
.
For this problem, we use Perl which is a programming language frequently used in text processing thanks to its facilities. It can also be used in the command-line just like other commands.
There are a few details that we should know.
-0777
option lets us to slurp the file and feed all the lines to Perl in one go.-pe
is composed of two flags: e
which allows us to specify the Perl code to be run and, -p
flag processes the file line by line.s/[set1]/[set2]/
is the substitution operator which substitutes [set1]
by [set2]
./g
is the global flag which does the matching over the whole text./s
treats the string as a single long line (only difference is that \n
is also included when .
is used).Another example with Perl. The following regular expression replaces newlines with more than two occurrences with only two newlines. This can also be done with the previous commands that we introduced, like tr
.
Here you can learn more about the fascinating functionalities of regular expressions in Perl. Most of the text editors make it easy to test and debug regular expressions and visualizing them. There are also many websites for the same purpose, such as https://regex101.com/.
split
split [options] filename prefix
We perviously mentioned how to merge files using cat
. What if we want to split a huge text file into smaller parts? This can be done with split
.
split
allows us to split a file based on the number of lines with -l
option and based on size with -b
. Let’s split our sample text file into smaller files containing 500 lines each one:
How your text is to be pre-processed depends on the content of your file and your pre-processing purpose. You should know what the noisy stuff are and the issues that should be cleaned or normalised.
Suppose that we need the following steps to clean our sample text file:
perl -0777 -pe 's/<[^>]*>//gs'
)perl -0777 -pe 's/^[\S]*[^\w\s][\S]*$//gmi'
)javascript:void(0);
, and characters (awk '{gsub(/^.*javascript.*$/,"\n"); print }'
or awk '{gsub(/ /," "); print }'
)Following the techniques introduced earlier in this tutorial, we clean the sample text as follows:
This is how this long command cleans our sample text file like a charm!
Last updated on 28 April 2019.