APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Slightly Scrambled - unsorting a file

I mentioned looking for "short" pages as part of my on-going and long overdue site cleanup. As I explained at that link, I used a script to help me run through those files and made a first pass at the worst offenders with a weeks worth of work.

I then created a new list, again sorted by size and started going through it again. This was, of course, excruciatingly boring because I had looked at most of these at least once. I needed to look at them again because my first pass was deliberately quick and aimed at cleaning up obviously unneeded pages; I needed to go back to make final decisions, combine appropriate pages and so on. However, the tedium was hard to take.

My first thought was to simply grab random pages. That works, but it has some disadvantages - repetition of already seen pages, yes, but also that it was too random - I really need to concentrate on the shorter pages.


Aaargh. What to do? I wanted an "unsort", obviously, but I wanted it to be biased.

Introducing biased randomness

Un-sorting plainly will rely on random numbers. We've talked about bias in randomness before and it's fairly easy to introduce accidental bias, too.

Let's start by looking at how you might "unsort" a file. I'll use Perl.

Here's a typical way to approach the problem. It uses Perl's associative arrays and (somewhat ironically) uses "sort" to produce random results:

#!/usr/bin/perl
while (<>) {
   $lines{rand()} = $_;
}
foreach  (sort keys %lines) {
   print $lines{$_};
}

That's imperfect as written because there is a chance of generating the same pseudorandom number and thereby overwrite an existing line. That's ok because we need to add bias anyway:

#!/usr/bin/perl 
$BIAS=1;
while (<>) { 
   while (1) {
   $r=rand($BIAS);
   last if not $lines{$r};
   }
   $lines{$r} = $_; 
   $BIAS++;
   
} 

foreach  (sort keys %lines) { 
   print $lines{$_}; 
} 

That does tend to cluster lines around their original positions (and still runs some risk of overwrite), but I don't like it because it's not immediately obvious how to control the amount of bias. I also don't like it because of the associative arrays - I might want to do this in some language that doesn't make that so easy. So, first let's try an "unsort" without associative arrays:

#!/usr/bin/perl
@stuff=<>;
srand;
$max=scalar @stuff;
$count=0;
while ($count < $max) {
   $r=int(rand($max));
   next if $done[$r];
   $done[$r]=1;
   print $stuff[$r];
   $count++;
}
 

That's easy enough, but again how do I introduce bias? I don't like it. Let's try it a bit differently:

!/usr/bin/perl
$SCRAMBLE=25;
srand;
for ($x=0;$x < $SCRAMBLE; $x++) {
 $array[$x]=<>;
}
while (1) {
$r=int(rand($SCRAMBLE));
print $array[$r];
$array[$r]=<>;
last if eof;
}
for ($x=0;$x < $SCRAMBLE;$x++) {
 print $array[$x];
}
 

This is very biased. The lower the value assigned to $SCRAMBLE, the more biased it is toward the original order of the input. For example, take a 250 line input produced with this:

for i in {1..250} ; do echo $i ; done 
 

(See Bash 3.00 brace expansion)

Let's try setting $SCRAMBLE and see what we get. To make the results easier to read, I'll show the first and last few lines and space them across the page in groups of 10. So, with $SCRAMBLE set at 25, we might get:

3 8 1 23 12 13 27 24 31 11
17 6 19 36 30 40 26 37 25 4
29 21 38 33 18 7 32 20 50 44

...

239 241 225 227 233 249 238 245 205 142
248 210 242 226 229 197 171 240 222 250
243 216 247 215 218 234 235 246 213 244
 

But with $SCRAMBLE at 125, it's much more random:

111 91 1 59 36 89 97 22 50 20
47 99 31 135 75 112 76 63 12 43
94 80 88 109 18 51 4 148 58 33

...

179 247 230 137 100 245 102 103 160 222
220 107 239 191 227 186 141 113 243 224
116 117 118 119 120 201 210 181 248 125

If we lower it to 5, it is very biased toward the original order:

2 1 5 7 3 8 4 12 13 11
15 14 10 18 17 20 6 21 22 24
25 16 19 23 26 29 28 32 27 31

...

218 225 227 214 228 230 224 231 226 232
233 234 217 237 238 236 240 242 241 239
244 235 229 243 246 250 247 249 248 245
 

There's something else interesting about this code: unlike the other examples, it doesn't use any memory to store the file to be scrambled - it's RSS (Resident Set Size) will be constant no matter how much data you give it.

But there is a problem: if we happened to set $SCRAMBLED to 250, it doesn't unsort at all - only the very first line is out of order:

61 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29

...

221 222 223 224 225 226 227 228 229 230
231 232 233 234 235 236 237 238 239 240
241 242 243 244 245 246 247 248 249 250
 

That's easy enough to fix:

#!/usr/bin/perl
$SCRAMBLE=250;
srand;
for ($x=0;$x < $SCRAMBLE; $x++) {
 $array[$x]=<>;
}
shuffle(\@array);
while (1) {
$r=int(rand($SCRAMBLE));
print $array[$r];
$array[$r]=<>;
last if eof;
}
shuffle(\@array);
for ($x=0;$x < $SCRAMBLE;$x++) {
 print $array[$x];
}

sub shuffle {
my $array=shift;
my $i;
for ($i = @$array; --$i; ) {
  my $j= int rand ($i+1);
  next if $i == $j;
  @$array[$i,$j]= @$array[$j,$i];
}
}

Though we have lost all bias on short inputs, of course and both of these have another bug: what happens if our input file has less than $SCRAMBLE lines in it? We need to modify our script slightly:


#!/usr/bin/perl
$SCRAMBLE=250;
srand;
for ($x=0;$x < $SCRAMBLE; $x++) {
 $array[$x]=<>;
 last if eof;
}
shuffle(\@array);
while (1) {
$r=int(rand($SCRAMBLE));
print $array[$r];
$array[$r]=<>;
last if eof;
}
shuffle(\@array);
for ($x=0;$x < $SCRAMBLE;$x++) {
 # don't need to do this
 #print $array[$x] if $array[$x];
 print $array[$x] ;
}

sub shuffle {
my $array=shift;
my $i;
for ($i = @$array; --$i; ) {
  my $j= int rand ($i+1);
  next if $i == $j;
  @$array[$i,$j]= @$array[$j,$i];
}
}

This is what I want. It's just a matter of choosing an appropriate number for $SCRAMBLE that gives me the desired bias. A possible change might be to read in everything, see how big it is, and then set $SCRAMBLE to some fraction of that. The code gets more complicated:


#!/usr/bin/perl
@file=<>;
$lines=scalar @file;
$DIVISOR=3;
$SCRAMBLE=int($lines / $DIVISOR);
if ($SCRAMBLE < 2) {
 shuffle(\@file);
 foreach(@file) {
  print;
 }
 exit 0;
}

srand;
$lineount=0;
for ($x=0;$x < $SCRAMBLE; $x++) {
 $array[$x]=$file[$linecount++];
 last if $linecount == $lines;
}
shuffle(\@array);
while (1) {
$r=int(rand($SCRAMBLE));
print $array[$r];
$array[$r]=$file[$linecount++];
last if $linecount == $lines;
}
shuffle(\@array);
for ($x=0;$x < $SCRAMBLE;$x++) {
 #print $array[$x] if $array[$x];
 print $array[$x] ;
}

sub shuffle {
my $array=shift;
my $i;
for ($i = @$array; --$i; ) {
  my $j= int rand ($i+1);
  next if $i == $j;
  @$array[$i,$j]= @$array[$j,$i];
}
}

The larger $DIVISOR, the more bias - unless the file is smaller than $DIVISOR.

Testing the code

It's always easy to screw up code, and this is no exception. There are simple checks, though:

for i in {1..20} ; do echo $i ; done | scramble | sort -un | wc -l
 

That should always produce the same number as you used in the brace expansion - "20" in the example given. This checks for both missing lines and any duplicated lines.

A C++ Unsort

If all you need is an unsort, use one of the Perl examples or download this C++ code. It comples easily on Linux (no makefile, just "gcc -std=c99 -O2 -Wall unsort-1.0.c -o unsort"), though unfortunately not on Mac OS X due to missing "getline".

While of limited use, this code is interesting and was fun to play with. Now it's time to get back to the site cleanup!

Also see rl: randomize lines by Arthur de Jong.


Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Unsorting a file with deliberate bias


2 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Mon Apr 4 18:55:49 2011: 9423   NickBarron

gravatar


Any excuse to distract you from the cleanup.... ;)





Tue Apr 5 12:09:24 2011: 9427   TonyLawrence

gravatar


Well, yes :)

It gets so boring...

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





What do such machines really do? They increase the number of things we can do without thinking. Things we do without thinking — there's the real danger. (Frank Herbert)

Computers have been taught to distrust each other and will reject attempted connections most of the time. Nowadays, most computers and firewalls are utterly rude about it: it would be like asking someone to dance and having them ignore you as though you were invisible and inaudible. (Tony Lawrence)












This post tagged: