Networking | Hardware | Software | Multimedia | System | Unix&Linux | MBA

Home>>Unix&Linux>>Regular expressions: find paragraphs not containing??

Regular expressions: find paragraphs not containing??

elmimmo
09-14-2005, 10:23 AM
Hi,

I am playing with TextWrangler (http://www.barebones.com/products/textwrangler/index.shtml) grep search&replace and I have once again encountered a search I repeatedly do not know how to write, which is "search a string not containing this other string".

Since I guess this does not make sense, I will put the example. I have this text file with gazillions of vcards exported from address book, and I am trying to find all that do NOT have an EMAIL set.

So the text file looks like this

BEGIN:VCARD
VERSION:3.0
N:;John;;;
FN:John
TEL;type=pref:555 55 55 55
END:VCARD

BEGIN:VCARD
VERSION:3.0
N:Gordon;Ann;;;
FN:Ann Gordon
EMAIL;type=INTERNET;type=WORK;type=pref:user@host.com
TEL;type=CELL;type=pref:987 654 321
END:VCARD

BEGIN:VCARD
VERSION:3.0
N:Smith;Mike;;;
FN:Mike Smith
TEL;type=HOME;type=pref:6666 66 66
TEL;type=CELL:124 456 678
END:VCARD

How would you write a regexp that would match any full VCard NOT having an email (i.e. first and third vcard).

PS: TextWrangler's help file states that its regexp engine is "PCRE-based grep engine". Besides, it "supports several extended sequences, which provide grep patterns with super-powers from another universe. Their syntax is in the form (?KEY...)." If you need me to copy paste from the help file what these extensions are, just ask.

elmimmo
09-14-2005, 11:32 AM
Ok, I think an answer to this easier question would do it: How do I search for "anything that is not this string" (instead of anything that is not this character, which I know how to do with "^").

nkuvu
09-14-2005, 12:26 PM
Well if it's really a PCRE (that's Perl compatible regular expression, if you didn't know) you could do something like negative lookahead assertion:
^(?!.*(?:email))
Note that ^ only negates a character class, by itself it indicates the beginning of a line.

Of course by itself this regex matches all of the lines except the email address line. So you need some way to associate not just lines, but cards.

The regex in question I tested using the following Perl script which does things that TextWrangler may or may not do:
#!/usr/bin/perl
use strict;
use warnings;

my $string = <<'MONKEYS';
BEGIN:VCARD
VERSION:3.0
N:;John;;;
FN:John
TEL;type=pref:555 55 55 55
END:VCARD

BEGIN:VCARD
VERSION:3.0
N:Gordon;Ann;;;
FN:Ann Gordon
EMAIL;type=INTERNET;type=WORK;type=pref:user@host.com
TEL;type=CELL;type=pref:987 654 321
END:VCARD

BEGIN:VCARD
VERSION:3.0
N:Smith;Mike;;;
FN:Mike Smith
TEL;type=HOME;type=pref:6666 66 66
TEL;type=CELL:124 456 678
END:VCARD
MONKEYS

my @cards = split "\n\n", $string;

foreach my $card (@cards) {
if ($card =~ /^(?!.*(?:email))/si) {
print "card has no email address:\n$card\n\n";
}
else {
print "card has email address:\n$card\n\n";
}
}
The output from said script is:
card has no email address:
BEGIN:VCARD
VERSION:3.0
N:;John;;;
FN:John
TEL;type=pref:555 55 55 55
END:VCARD

card has email address:
BEGIN:VCARD
VERSION:3.0
N:Gordon;Ann;;;
FN:Ann Gordon
EMAIL;type=INTERNET;type=WORK;type=pref:user@host.com
TEL;type=CELL;type=pref:987 654 321
END:VCARD

card has no email address:
BEGIN:VCARD
VERSION:3.0
N:Smith;Mike;;;
FN:Mike Smith
TEL;type=HOME;type=pref:6666 66 66
TEL;type=CELL:124 456 678
END:VCARD

jecwobble
09-14-2005, 12:39 PM
I can't test this out since I'm away from my Mac, but if I were trying to do this, I would make a copy of the file with the vCard info in it and open that with TextWrangler.

I would do find-n-replace on the copy so that each vCard entry was on one line. If the cards are seperated by two returns, I would first replace "\r" with something like "~". That gives you two tildes between each card and one tilde between each line of each card.

Replace "~~" with "\r". Now you have one line per card.

Save the file, let's say it's named cards.txt.

In the Terminal, run grep -v EMAIL cards.txt > cards-no-email.txt

This creates another file called cards-no-email.txt

Open cards-no-email.txt in TextWrangler and replace "\r" with "\r\r".

Replace "~" with "\r". You should now have a file of vCards without email addresses.

pmccann
09-14-2005, 07:09 PM
Just for fun here's a nice one-liner to do the trick... if the file is called "filename" just use:

perl -000 -ne 'print unless /^EMAIL/m' filename

Redirect it to another file if you want to save the entries that don't have EMAIL fields...

perl -000 -ne 'print unless /^EMAIL/m' filename > no_email_addr

[[Quick explanation: -000 puts perl into "paragraph mode", so that it grabs a paragraph at a time from filename when looping (via the -n switch). The regular expression just says, "find any paragraph that contains a line beginning with "EMAIL" : the m flag means "multiline match", which switches "^" to mean "beginning of line" instead of its usual "beginning of string". So we print every paragraph that doesn't match this regex.]]

Cheers,
Paul

jecwobble
09-15-2005, 01:15 AM
pmccan - I assume perl would consider a paragraph as any string of text seperated by a return character. If that is the case, each line of the vCard file would be considered a paragraph, right?

nkuvu
09-15-2005, 12:44 PM
One end-of-line character is a line. Two is a paragraph.

From perldoc perlrun:
-0[*octal/hexadecimal*]
specifies the input record separator ($/) as an octal or
hexadecimal number. If there are no digits, the null character is
the separator. Other switches may precede or follow the digits. For
example, if you have a version of find which can print filenames
terminated by the null character, you can say this:

find . -name '*.orig' -print0 | perl -n0e unlink

The special value 00 will cause Perl to slurp files in paragraph
mode. The value 0777 will cause Perl to slurp files whole because
there is no legal byte with that value.

If you want to specify any Unicode character, use the hexadecimal
format: "-0xHHH...", where the "H" are valid hexadecimal digits.
(This means that you cannot use the "-x" with a directory name that
consists of hexadecimal digits.)
I hadn't heard of this switch before, but it's very nifty keen.

And of course, if you try the one-liner that pmccann supplied, it works perfectly.


 

TOP

Windows Server Outsell
Unix Signals And C++ E

For more info

Unix Signals And C++ E
Windows Server Outsell
ssh setup for password
Bash script does not w
esc code 
ARD send unix command 
question about binarie
Scanning mail 
Issuing multiple comma
How do I install Linux

News Archive

/etc/hosts? 
Manually Start a Start
mounting missed hard d
Using Netinfo in Singl
darwin/bsd login probl
system_profiler and fi
Send mail from script 
mounting a drive 
OS X disk first aid ha
system.log shows steal

Related stories:

which online man pages for 10.4 ?
rm command that includes dot-files?
file attributes
Script to copy AND encode Audio CD
Represent ?symbol in command line?
Installing Freevo on OS X
Basic Unix question - help "jobs" command
Change modification time

Copyright@2004-2005 www.zzcoke.com All Right Reserved

advanced web statistics