Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
How to split this ASCII file into chunks? Programmer needed
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
urcindalo
l33t
l33t


Joined: 08 Feb 2005
Posts: 623
Location: Almeria, Spain

PostPosted: Thu Jan 17, 2013 10:03 am    Post subject: How to split this ASCII file into chunks? Programmer needed Reply with quote

Hi and thanks for helping me out.

I need a very simple script, preferably in bash, so that I can split an ASCII file into chunks.
The file itself shows the following structure:
Code:
ZINC02384989
  -OEChem-01171301283D

 42 43  0     1  0  0  0  0  0999 V2000
    9.1303    3.5000    3.8395 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.2989    3.9671    3.2534 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.3583    4.2061    1.8981 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.2372    3.9779    1.1064 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.0556    3.5046    1.7052 C   0  0  0  0  0  0  0  0  0  0  0  0
... -> many lines here like the ones above and below

 23 40  1  0  0  0  0
 37 41  1  0  0  0  0
 38 41  1  0  0  0  0
 41 42  1  0  0  0  0
M  CHG  3  39  -1  40  -1  41   1
M  END
$$$$
ZINC04899456
  -OEChem-01171301283D

 65 66  0     1  0  0  0  0  0999 V2000
    4.5113    4.3431    3.0084 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5217    2.9765    3.6963 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.3676    3.0525    4.9689 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.0899    2.5770    4.0587 C   0  0  0  0  0  0  0  0  0  0  0  0
....
....
$$$$


I want every chunk of text between the $$$$ marks to be written out to an individual text file in the same directory, the name of which will be the first entry in every chunk: ZINC02384989 for the first file, ZINC04899456 for the second... The file extension should be .sdf
In other words, the first individual ZINC02384989.sdf file will begin with ZINC02384989 and will end with $$$$, the second one will begin with ZINC04899456 and will end with another $$$$, and so on.

The original file to be splitted ends with a final $$$$ mark but begins with no one, as shown above. Its file name has also the .sdf extension, but both its filename and its extension can be freely changed for the script to work.

In the worst scenario there could be literally thousands of those chunks, so manually generating the individual files is out of the question.

Helps very much in advance.
Back to top
View user's profile Send private message
urcindalo
l33t
l33t


Joined: 08 Feb 2005
Posts: 623
Location: Almeria, Spain

PostPosted: Thu Jan 17, 2013 10:21 am    Post subject: Reply with quote

I forgot to mention that sometimes the same entry name may appear more than once, since they correspond to different molecule conformations.

So, the script must have some kind of checking to not overwrite previous individual files and to add something like "-2", "-3"... to the filenames in these cases.
Back to top
View user's profile Send private message
tomk
Bodhisattva
Bodhisattva


Joined: 23 Sep 2003
Posts: 7221
Location: Sat in front of my computer

PostPosted: Thu Jan 17, 2013 11:38 am    Post subject: Reply with quote

I had something similar that I'd written in perl that I've modified to suit your needs. Save it as split.pl then run: split.pl filename

split.pl:
#!/usr/bin/perl

use warnings;
use strict;

$| = 1;

my $match = '\$\$\$\$';
my $line;   
my $outfile;
my $snapfile;
my $suffix;
my @sorted;

if ($#ARGV != 0) {
    print STDERR "you must specify a snapshot file\n";
    exit(1);
} else {
    $snapfile = $ARGV[0];
}

open(SNAP, $snapfile) || die("couldn't open $snapfile");

while ($line = <SNAP>) {
    if ($line =~ m/^$match$/) {
        print OUTFILE $line if (defined $outfile);

        print "writing to $outfile\n";
        close(OUTFILE);

        undef $outfile;
    } else {
        if (! defined $outfile) {
            chomp ($outfile = $line);

            if (-e "$outfile.sdf") { 
                @sorted = map{$_->[0]}   
                sort{$b->[1] <=> $a->[1]}
                map{[$1,/^$outfile-(\d+)/]} glob("$outfile-*.sdf");

                if ($sorted[0]) {
                    $suffix = $sorted[0] + 1;
                } else {
                    $suffix = 1;
                }
                $outfile .= "-$suffix";
            }

            $outfile .= ".sdf";
            open(OUTFILE, ">$outfile");
        }

        print OUTFILE $line if (defined $outfile);
    }
}

close(OUTFILE);
close(SNAP);

_________________
Search | Read | Answer | Report | Strip


Last edited by tomk on Thu Jan 17, 2013 3:55 pm; edited 1 time in total
Back to top
View user's profile Send private message
urcindalo
l33t
l33t


Joined: 08 Feb 2005
Posts: 623
Location: Almeria, Spain

PostPosted: Thu Jan 17, 2013 11:56 am    Post subject: Reply with quote

I can hardly express my gratitude. The script works like a charm. Thanks very much indeed.
This is the reason why Gentoo has the best Linux user community out there :)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum