Search  
Thursday, August 28, 2008 ..:: Perl Wiki ::.. Register  Login
History for Using Regular Expressions to Parse HTML Part 4 (history as of 04/25/2007 18:54:01)

In Part 1 we had a quick look at what Perl and regular expressions
are, and introduced the idea of using them to process HTML files. In
Part 2 we developed a Perl script to process a single HTML file. In
part 3 we looked at one way of processing multiple files. In this part
we'll look at an additional way to import files for processing.

In Part 3 we wrote a script (script2.pl) that enabled us to enter filenames at the command prompt:

c:>perl script2.pl file1.htm file2.htm file3.htm

Although
this script enables us to process as many files as we want to, the
drawback is that all the filenames need to be manually typed in. This
is fine if you only want to process a few files, but if you've got
hundreds or thousands to process, this approach would not be feasible.

Note:
Due to display considerations, in the example code shown in this
article, square brackets '[..]' are used in HTML/script tags instead of
angle brackets '<..>'.

script2.pl

1 foreach $file (@ARGV) {

2 rename $file, "$file.bak";

3 open (IN, "<$file.bak");

4 open (OUT, ">$file");

5 while ($line = [IN]) {

6 $line =~ s/[h1]/[h1 class="big"]/;

7 (print OUT $line);

8 }

9 close IN;

10 close OUT;

11 }

In
script2.pl, it's line 1 that enables us to enter filenames at the
command prompt. script3.pl, which is listed below, provides us with a
way to process all the HTML files (that have a .htm extension) in the
current directory/folder. This is the directory where all the files to
be processed, and the script itself, are located.

script3.pl

1 opendir(DIR, ".") or die "can't opendir: $!";

2 @allfiles = grep (/.htm$/i, readdir DIR);

3 closedir(DIR);

4 foreach $name (@allfiles) {

5 rename $file, "$file.bak";

6 open (IN, "<$file.bak");

7 open (OUT, ">$file");

8 while ($line = [IN]) {

9 $line =~ s/[h1]/[h1 class="big"]/;

10 (print OUT $line);

11 }

12 close IN;

13 close OUT;

14 }

The only difference between script2.pl and script3.pl is the first few lines. Let's look at the new lines in script3.pl.

Line 1

Opens the current directory (signified by a dot ".") for processing. It
is given a directory handle of DIR. If the directory cannot be opened,
an error message is displayed.

Line 2

This line reads in all the ,htm files in the directory, and puts them
in an array called @allfiles. In Perl, a '@' indicates an array, and a
'$' indicates a variable. A variable stores a single value, whereas an
array stores a list of values.

grep is a search command from the UNIX world.

Note that there should be a backslash character directly before the '.htm', but it isn't being displayed.

Line 3

This line closes the DIR directory handle.

Running the script

c:>perl script3.pl





In Part 5 we'll look at how to read in specific files from specific directories.

About
the Author: John Dixon is a web developer and technical author. These
days, John spends most of his time developing dynamic database-driven
websites using PHP and MySQL.


Go to http://www.computernostalgia.net to view one of John's sites. This site contains articles and photos relating to the history of the computer.


To find out more about John's work, go to http://www.dixondevelopment.co.uk.

Article Source: http://EzineArticles.com/?expert=John_Dixon

  

|<< Back |    

Copyright 2007 by Perl Pages Forum   Terms Of Use  Privacy Statement