"Amigos y nadie más. El resto, la selva"
-- Jorge Guillén

Nifty parsing trick level 101

From the tokenizing-input-sucks! dept. (3872) (0) by Luis

Say you have a file with lines like the following that need to be parsed into a data structure:

my_line has="values" and="also values" finally="values,"
my_2nd_line has="other values" and="also values with more" finally="values,and,valuables"

And you need a data structure like:

$obj->{my_line}->[{has=>"values",and=>"also values",finally="values,"},...]

In other words, you need a hash (key,value) that contains arrays which are hashes themselves. That would make it easier to navigate them and print anything you would like. (This is very common when parsing formatted text like XML files and so on).

Well, how do you actually do this? One approach that works well is to read the whole file line by line. All lines have new-line characters separating them, so you could read each line and then do further parsing in each line. Then each line will need to be separated into "tokens". Now, each line contains all kinds of characters that you cannot anticipate. So how do you actually split the lines by a character that is not common in any subsequent line?

The trick is to parse the line and add whatever character you would like to be your "token" separator on each read. So you can do things like:

foreach line in lines do
   if line contains char sequence " followed by space
       substitute "space with "tab+tab
   array tokens = split lines by tab+tab
   foreach token in tokens do
        assign to key, val the result of splitting token by equal sign (=)
        hash h{key} = val

And that will yield your desired array. Of course, you could use many other ways to parse these lines like treating each string as an array and assembling each line and compare them with known tokens, etc... This could be very error prone, and it would take time to write and test. In fact, although this method is not perfect, it's easy to implement if you already have a general idea how the end file will look like (say you are the only person actually creating these files). In short, if you cannot stick to a library, using this technique can get you moving.

Here is a sample Perl code to do this:

use Data::Dumper;
my %database = ();
open(FILE, "< sample_file.txt");
$/ = "\n";
while (my $_l = <FILE>)
    chomp $_l;
    my $obj;
    if ($_l =~ /^\s*(my_line|my_2nd_line)/i)
        $database{$1} = [];
        $obj = $database{$1};
        $_l =~ s/^\s*(my_line|my_2nd_line)//;
        print STDERR "skipping line\n";
    $_l =~ s/"\s+/"\t\t/g;

    my @tokens = split(/\t\t/, $_l);
    foreach my $token (@tokens)
        my ($key, $val) = split(/=/, $token);
        $key =~ s/^\s*|\s*$|"//g;
        $val =~ s/^\s*|\s*$|"//g;
        my %hash = ( $key => $val );
        push (@{$obj}, \%hash);

print Dumper \%database;

The output looks like this:

$VAR1 = {
      'my_line' => [
                       'has' => 'values'
                       'and' => 'also values'
                       'finally' => 'values,'
      'my_2nd_line' => [
                           'has' => 'other values'
                           'and' => 'also values with more'
                           'finally' => 'values,and,valuables'

*Update 2011-09-07 15:00 UTC: * Of course you could change your regular expression (regex) to just split by something unique between your tokens like:
my @tokens = split(/\s*"\s+/, $_l);
However, this was just an exercise to show alternatives.


New Comment

* optional