Monday, July 4, 2016

Converting a pdf to csv using linux shell script

linux script to extract data from pdf and create a csv. The regular expressions for sed are rather different from the Perl like ones i am used to in java. So \d is not allowed, + needs to be escaped, etc.

Below, we iterate thru pdfs, use pdftk to get the uncompressed version that has text, use strings to extract string data, use tr to remove newlines, apply sed on it to extract particular fields that we want, assign those to variables, and echo the variables to a csv file.

rm pdf.csv
for FILE in *.pdf
do
  echo $FILE
  pdftk "$FILE" output - uncompress | strings | grep ")Tj" | tr '\n' ' ' | sed -e 's/)Tj /) /g'  > temptocsv.txt
  AMOUNT=`sed -e 's/.*(Rs \:) \([0-9]\+\).*/\1/' temptocsv.txt`
  CHLDATE=`sed -e 's/.*(Date of) (challan :) (\([^)]\+\)).*/\1/' temptocsv.txt`
  SBIREFNO=`sed -e 's/.*(SBI Ref No. : ) (\([^)]\+\)).*/\1/' temptocsv.txt`
  CHLNO=`sed -e 's/.*(Challan) (No) (CIN) \(.*\) (Date of).*/\1/' temptocsv.txt`
  echo $FILE,$CHLDATE,$SBIREFNO,$CHLNO,$AMOUNT >> pdf.csv
done