Refactoring bash scripts

Checking file permissions in terminal

$ ls -l
drwxrwxr-x 3 vagrant vagrant 4096 Jun  7 00:58 R
-rw-rw-r-- 1 vagrant vagrant    7 Jun  8 20:22
-rw-rw-r-- 1 vagrant vagrant    7 Jun  8 20:22
# -rw-rw-r--
#first letter (-) indicates, it's file.
#next three letters (rw-) indicate access permissions for the user who owns the file (r - read, w - write, x - execute).
#next three letters (rw-) indicate the access permission of group memembers who owns file
# last three letter (r--) indicate the file permission for all other users.

# changing file permission
$ chmod u+x
#indicates we want to change the permissions for the user who owns the file; + indicates that we want to add a permission; x indicates the writing permission
$ ls -l
drwxrwxr-x 3 vagrant vagrant 4096 Jun  7 00:58 R
-rw-rw-r-- 1 vagrant vagrant    7 Jun  8 20:22
-rwxrw-r-- 1 vagrant vagrant    7 Jun  8 20:22

Adding shebang to scripts – It is is a special line in the script that instructs the system which executable should be used to interpret the commands.

# if we want to use bash to interpret the commands, the following line should be added,
$#!/usr/bin/env bash
# if we want to use python to interpret the commands, we should use following command,
$#!/usr/bin/env python

Changing bash scripts to more reusable way.

#!/usr/bin/env bash
curl -s |
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n 10

# if we look at above script, the first part is input, and we can make user to change the -n 10. Therefore we can refactor the script as follows,
#!/usr/bin/env bash
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n $NUM_WORDS

#above script should run as follows,
cat data/ | ./ 10 

Makingsure we can run bash script from anywhere – In order to do this, we need to add path.

#how to check the added path
$ echo $PATH | tr : '\n' | sort

#To change the PATH permanently, you’ll need to edit the .bashrc or .profile file located in your home directory. If you put all your custom command-line tools into one directory, say, ~/tools, then you’ll only need to change the PATH once.

Processing data form bash

Compressing Files

#compressing data/tmp.txt to temp.tar.gz
$ tar -zcvf temp.tar.gz data/tmp.txt
#-z - compress archive using gzip algorithm
#-c - Create archive
#-v - verbose, display progress while creating archive
f - archive File name

Decompressing Files

# decompress the file in the same directory
$ tar -zxvf temp.tar.gz
# -x - extract files

# decompress the file in a particular directory
$ tar -zxvf temp.tar.gz -C /tmp

Processing CSV files in bash – csvkit is used to process the csv files.

# installing csvkit on ubuntu
$ sudo pip install csvkit

# getting data from web
$ wget

# reading xlsx file
$ in2csv data-science-cmd-line/book/ch03/data/imdb-250.xlsx | head -n 3

# though .xlsx files are not readable, we can make it to readable format by using csvlook command
$ in2csv data-science-cmd-line/book/ch03/data/imdb-250.xlsx | head -n 3 | csvcut -c Title,Year,Rating | csvlook

Querying relational databases from bash – If the data is stored in SQL database, sql2csv command is used to query the data. sql2csv supports SELECT, INSERT, UPDATE, and DELETE queries.

#how to select specific data from sql database
$ sql2csv --db  'sqlite:///data-science-cmd-line/book/ch03/data/iris.db' --query 'SELECT * FROM iris ''WHERE sepal_length > 5.5'
# option --db follows the url link of SQL database

Reading date from web API – web APIs return data in a structured format, such as JSON or XML. It is easily processed by other tools, such as jq.

curl -s | jq '.'

Familiarizing with command-line tools

pwd – this command prints the name of current directory.

$ pwd

ls – command ls is used to view the contents of directory

$ ls
data-science  R  repos

cd – command cd is used to navigate to different directories

$ cd R/x86_64-pc-linux-gnu-library/
$ cd ..

head – command head is used to view the first few lines of file. Values which come after the command are called command-line arguments or options.

$ head -n 5 data-science/data-science-at-the-command-line-master/book/ch02/data/movies.txt
Star Wars
Home Alone
Indiana Jones
Back to the Future

grep – it is used to filter lines.
wc – wc is used to count lines.
sort – it is used to sort lines.
seq – seq generates a sequence of lines.

backslash (\) is used to break up a long command to multiple lines.

$  echo 'ahilan'\
> 'kana'

(>) is the continuation prompt, which indicates that this line is a continuation of the previous one.

Shell function – A shell function is a function that is executed by bash.

$ fac() { (echo 1; seq $1) | paste -s -d\* | bc; }
$ fac 6

Alias – aliases are often defined in .bashrc or .bash_aliases configuration files. Currently defined alias can be found by running alias command.

Identifying the type of command line tool

$ type -a pwd
pwd is a shell builtin
pwd is /bin/pwd

$ type -a fac
fac is a function
fac ()
    ( echo 1;
    seq $1 ) | paste -s -d\* | bc

Combining Command-line tools – command-line tools are combined through pipe (|).

# seq generates teh sequence of numbers
$ seq 20

# we can pipe the ouput of seq to a second command-line tool (grep) to filter lines.
$ seq 20 | grep 4

#if we wanted to know how many numbers between 1 and 100 contain a “5”.
$ seq 100 | grep 5 | wc -l

Save the output of command-line tools

# if numbers.txt file is available, the following command will overwrite it
seq 100 > numbers.txt
# though numbers.txt file is available, the following command will append the following output with existing one.
seq 20 >> numbers.txt

mv – is used to rename or move entire directories.
rm – is used to remove the files. Option -r is used to delete the files recursively.
cp – is used to copy the files.
mkdir – is used to create directory.
man – is used to find the help of command-line tools.