I mentioned looking for "short" pages as part of my on-going and long overdue site cleanup. As I explained at that link, I used a script to help me run through those files and made a first pass at the worst offenders with a weeks worth of work.
I then created a new list, again sorted by size and started going through it again. This was, of course, excruciatingly boring because I had looked at most of these at least once. I needed to look at them again because my first pass was deliberately quick and aimed at cleaning up obviously unneeded pages; I needed to go back to make final decisions, combine appropriate pages and so on. However, the tedium was hard to take.
My first thought was to simply grab random pages. That works, but it has some disadvantages - repetition of already seen pages, yes, but also that it was too random - I really need to concentrate on the shorter pages.
Aaargh. What to do? I wanted an "unsort", obviously, but I wanted it to be biased.
Un-sorting plainly will rely on random numbers. We've talked about bias in randomness before and it's fairly easy to introduce accidental bias, too.
Let's start by looking at how you might "unsort" a file. I'll use Perl.
Here's a typical way to approach the problem. It uses Perl's associative arrays and (somewhat ironically) uses "sort" to produce random results:
#!/usr/bin/perl while (<>) { $lines{rand()} = $_; } foreach (sort keys %lines) { print $lines{$_}; }
That's imperfect as written because there is a chance of generating the same pseudorandom number and thereby overwrite an existing line. That's ok because we need to add bias anyway:
#!/usr/bin/perl $BIAS=1; while (<>) { while (1) { $r=rand($BIAS); last if not $lines{$r}; } $lines{$r} = $_; $BIAS++; } foreach (sort keys %lines) { print $lines{$_}; }
That does tend to cluster lines around their original positions (and still runs some risk of overwrite), but I don't like it because it's not immediately obvious how to control the amount of bias. I also don't like it because of the associative arrays - I might want to do this in some language that doesn't make that so easy. So, first let's try an "unsort" without associative arrays:
#!/usr/bin/perl @stuff=<>; srand; $max=scalar @stuff; $count=0; while ($count < $max) { $r=int(rand($max)); next if $done[$r]; $done[$r]=1; print $stuff[$r]; $count++; }
That's easy enough, but again how do I introduce bias? I don't like it. Let's try it a bit differently:
!/usr/bin/perl $SCRAMBLE=25; srand; for ($x=0;$x < $SCRAMBLE; $x++) { $array[$x]=<>; } while (1) { $r=int(rand($SCRAMBLE)); print $array[$r]; $array[$r]=<>; last if eof; } for ($x=0;$x < $SCRAMBLE;$x++) { print $array[$x]; }
This is very biased. The lower the value assigned to $SCRAMBLE, the more biased it is toward the original order of the input. For example, take a 250 line input produced with this:
for i in {1..250} ; do echo $i ; done
(See Bash 3.00 brace expansion)
Let's try setting $SCRAMBLE and see what we get. To make the results easier to read, I'll show the first and last few lines and space them across the page in groups of 10. So, with $SCRAMBLE set at 25, we might get:
3 8 1 23 12 13 27 24 31 11 17 6 19 36 30 40 26 37 25 4 29 21 38 33 18 7 32 20 50 44 ... 239 241 225 227 233 249 238 245 205 142 248 210 242 226 229 197 171 240 222 250 243 216 247 215 218 234 235 246 213 244
But with $SCRAMBLE at 125, it's much more random:
111 91 1 59 36 89 97 22 50 20 47 99 31 135 75 112 76 63 12 43 94 80 88 109 18 51 4 148 58 33 ... 179 247 230 137 100 245 102 103 160 222 220 107 239 191 227 186 141 113 243 224 116 117 118 119 120 201 210 181 248 125
If we lower it to 5, it is very biased toward the original order:
2 1 5 7 3 8 4 12 13 11 15 14 10 18 17 20 6 21 22 24 25 16 19 23 26 29 28 32 27 31 ... 218 225 227 214 228 230 224 231 226 232 233 234 217 237 238 236 240 242 241 239 244 235 229 243 246 250 247 249 248 245
There's something else interesting about this code: unlike the other examples, it doesn't use any memory to store the file to be scrambled - it's RSS (Resident Set Size) will be constant no matter how much data you give it.
But there is a problem: if we happened to set $SCRAMBLED to 250, it doesn't unsort at all - only the very first line is out of order:
61 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ... 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
That's easy enough to fix:
#!/usr/bin/perl $SCRAMBLE=250; srand; for ($x=0;$x < $SCRAMBLE; $x++) { $array[$x]=<>; } shuffle(\@array); while (1) { $r=int(rand($SCRAMBLE)); print $array[$r]; $array[$r]=<>; last if eof; } shuffle(\@array); for ($x=0;$x < $SCRAMBLE;$x++) { print $array[$x]; } sub shuffle { my $array=shift; my $i; for ($i = @$array; --$i; ) { my $j= int rand ($i+1); next if $i == $j; @$array[$i,$j]= @$array[$j,$i]; } }
Though we have lost all bias on short inputs, of course and both of these have another bug: what happens if our input file has less than $SCRAMBLE lines in it? We need to modify our script slightly:
#!/usr/bin/perl $SCRAMBLE=250; srand; for ($x=0;$x < $SCRAMBLE; $x++) { $array[$x]=<>; last if eof; } shuffle(\@array); while (1) { $r=int(rand($SCRAMBLE)); print $array[$r]; $array[$r]=<>; last if eof; } shuffle(\@array); for ($x=0;$x < $SCRAMBLE;$x++) { # don't need to do this #print $array[$x] if $array[$x]; print $array[$x] ; } sub shuffle { my $array=shift; my $i; for ($i = @$array; --$i; ) { my $j= int rand ($i+1); next if $i == $j; @$array[$i,$j]= @$array[$j,$i]; } }
This is what I want. It's just a matter of choosing an appropriate number for $SCRAMBLE that gives me the desired bias. A possible change might be to read in everything, see how big it is, and then set $SCRAMBLE to some fraction of that. The code gets more complicated:
#!/usr/bin/perl @file=<>; $lines=scalar @file; $DIVISOR=3; $SCRAMBLE=int($lines / $DIVISOR); if ($SCRAMBLE < 2) { shuffle(\@file); foreach(@file) { print; } exit 0; } srand; $lineount=0; for ($x=0;$x < $SCRAMBLE; $x++) { $array[$x]=$file[$linecount++]; last if $linecount == $lines; } shuffle(\@array); while (1) { $r=int(rand($SCRAMBLE)); print $array[$r]; $array[$r]=$file[$linecount++]; last if $linecount == $lines; } shuffle(\@array); for ($x=0;$x < $SCRAMBLE;$x++) { #print $array[$x] if $array[$x]; print $array[$x] ; } sub shuffle { my $array=shift; my $i; for ($i = @$array; --$i; ) { my $j= int rand ($i+1); next if $i == $j; @$array[$i,$j]= @$array[$j,$i]; } }
The larger $DIVISOR, the more bias - unless the file is smaller than $DIVISOR.
It's always easy to screw up code, and this is no exception. There are simple checks, though:
for i in {1..20} ; do echo $i ; done | scramble | sort -un | wc -l
That should always produce the same number as you used in the brace expansion - "20" in the example given. This checks for both missing lines and any duplicated lines.
If all you need is an unsort, use one of the Perl examples or download this C++ code. It comples easily on Linux (no makefile, just "gcc -std=c99 -O2 -Wall unsort-1.0.c -o unsort"), though unfortunately not on Mac OS X due to missing "getline".
While of limited use, this code is interesting and was fun to play with. Now it's time to get back to the site cleanup!
Also see rl: randomize lines by Arthur de Jong.
Got something to add? Send me email.
More Articles by Anthony Lawrence © 2011-04-05 Anthony Lawrence
I can’t go to a restaurant and order food because I keep looking at the fonts on the menu. (Donald Knuth)
Mon Apr 4 18:55:49 2011: 9423 NickBarron
Any excuse to distract you from the cleanup.... ;)
Tue Apr 5 12:09:24 2011: 9427 TonyLawrence
Well, yes :)
It gets so boring...
------------------------
Printer Friendly Version
Slightly Scrambled - unsorting a file Copyright © April 2011 Tony Lawrence
Have you tried Searching this site?
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.
Contact us
Printer Friendly Version