2015/11/27

Parsing a smushed string

If you had a string of words smushed together without spaces, how would you go about parsing the string into words again?

http://blogs.perl.org/users/ingy_dot_net/2015/11/perl-regular-expression-awesomeness.html

https://gist.github.com/ingydotnet/94528c938ca94f684270

#!/usr/bin/env perl

use strict;

use Data::Printer;

my $input = 'minusthemessageforeverytriedword';

# All 3+ letter English words, longest to shortest:
my @long = grep {length > 2}
    sort {length $b <=> length $a}
    map {chomp, $_}
    `cat /usr/share/dict/words`;

# `for` over `fore`, `the` over `them`
unshift @long, qw( the for );

# Too many small words in dict file. Use these:
my @short = qw(
    ad ah am an as at ax be by do go he hi if in is it
    me my no of oh on or ox pi so to up us we yo a I
);
# Make a gigantic list of words for the regexp:
my $list = join '|', @long, @short;

my @words = $input =~ /\G($list)(?=(?:$list)*\z)/g;

p @words;

No comments:

Post a Comment