Missing data can be very annoying to a programmer. In fact, it is so annoying that very often we'll write separate programs to clean up data and eliminate unpleasant conditions so that the main program doesn't have to deal with it. Here, I'll show some examples of the kind of problems we see.
Let's take a comman data format, a TAB delimited file. A simplistic Perl program to read such a file might be:
#!/usr/bin/perl while (<>) { #split on tab into @x array @x=split /\t/; #print first three elements print "$x[0]\t$x[1]\t$x[2]\n"; }
An equivalent shell script might be
IFS="(tab here)" while read a b c d do echo "$a $b $c" done
The Perl script works, but the shell script doesn't. Here's the output if the input file looks like this:
$ cat t;hexdump -c t 1 2 3 4 1 3 4 2 3 4 1 2 4 3 4 0000000 1 \t 2 \t 3 \t 4 \n 1 \t \t 3 \t 4 \n \t 0000010 2 \t 3 \t 4 \n 1 \t 2 \t \t 4 \n \t \t 3 0000020 \t 4 \n 0000023
The Perl script produces
1 2 3 1 3 2 3 1 2 3
but the shell script messes up:
1 2 3 1 3 4 2 3 4 1 2 4 3 4
If this were a problem with Perl, we'd handle it like this:
#!/usr/bin/perl while (<>) { # make sure there is at least one space between adjacent tabs s/\t\t/\t \t/g; #split on tab into @x array @x=split /\t/; #print first three elements print "$x[0]\t$x[1]\t$x[2]\n"; }
But things can be worse. For example, if we are processing what was once a report format, we may have no delimiters, just empty space. We might see something like this:
Date Customer Phone Terms Balance 09/04/04 ABCD Corp. PPD 0.00 09/04/04 Abba Corp. 555-5555 Net 30 985.00You can't process that with delimiters, but you can use unpack:
#!/usr/bin/perl while(<>) { @x=unpack("A8A6A20A17A9A12",$_); print "$x[0]:$x[2]:$x[3]:$x[4]:$x[5]\n"; }
Which will produce:
Date:Customer: Phone:Terms: Balance :::: 09/04/04:ABCD Corp.::PPD: 0.00 09/04/04:Abba Corp.:555-5555:Net 30: 985.00
Comma separated value files can be annoying if they also contain commas within quoted fields. You can't use split because of that. There are at least two ways to handle that: either use the Text::Parsewords module:
#!/usr/bin/perl use Text::ParseWords; while(<>) { @x=quotewords(",",0,$_); foreach (@x) { print " $_"; } print "\n"; }
Or (assuming the data is regular enough), replace commas not inside quotes with a different delimiter and then split it. I think ParseWords is easier.
But sometimes none of that is going to work either. I'm working on a project right now where the input data can have up to three fields, but any of the three can be missing and there are no delimiters and no spacing. The only way to determine what we have is to know that the field one, if present, is alpha, field two is a whole integer, and field three will always have decimal points. So
ABC 982.00 8 15.45
means that I have 1 and 3 on line 1, only 2 on line 2, and only 3 on line 3. It's actually much worse than this; there are other fields, some of which are always present and some which are not, and it is quite a challenge to normalize this stuff to be able to massage the data. The way to handle it is to do splits on / /, and then determine what we got. So it's something like this:
#!/usr/bin/perl while(<>) { s/\s+/ /g; @x=split / /; foreach (@x) { .. determine what we have based on previous field(s) seen and content }
Got something to add? Send me email.
More Articles by Tony Lawrence © 2011-07-04 Tony Lawrence
Computers make it easier to do a lot of things, but most of the things they make it easier to do don't need to be done. (Andy Rooney)
Thu Mar 17 20:34:15 2005: 186 anonymous
with this prime number programme .
how do i find the last prime number that did not go into the prime number.i thought it would be something like
printf(" %-8.3F\n", $value++);but it do not work.
cat prime
#!/usr/bin/perl
print "enter a number> ";
$number = <STDIN>;
chomp( $number );
if ( $number !~ /^\d+$/ )
{
print "invalid input\n";
exit 1;
}
$prime = 1;
for( $value = 2; $value < $number; $value++ )
{
if ( $number % $value == 0 )
{
$prime = 0;
break;
}
}
if ( $prime == 1 )
{
print "prime number\n";
}
else
{
print "not a prime number\n";
}
exit 0;
Thu Mar 17 20:35:56 2005: 187 anonymous
with this prime number programme .
how do i find the last prime number that did not go into the prime number.i thought it would be something like
printf(" %-8.3F\n", $value++);but it do not work.
cat prime
#!/usr/bin/perl
print "enter a number> ";
$number = <STDIN>;
chomp( $number );
if ( $number !~ /^\d+$/ )
{
print "invalid input\n";
exit 1;
}
$prime = 1;
for( $value = 2; $value < $number; $value++ )
{
if ( $number % $value == 0 )
{
$prime = 0;
break;
}
}
if ( $prime == 1 )
{
print "prime number\n";
}
else
{
print "not a prime number\n";
}
exit 0;
colin_richard_weaver@hotmail.com
weaverc1@cardiff.ac.uk
Thu Mar 17 22:01:19 2005: 189 TonyLawrence
I don't think you understand the code. This is NOT looping through prime numbers, so there is no "last prime number".
------------------------
Printer Friendly Version
Handling missing data in inputs Copyright © September 2004 Tony Lawrence
Have you tried Searching this site?
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.
Contact us
Printer Friendly Version