Search  
Saturday, May 17, 2008 ..:: Perl Wiki ::.. Register  Login
Using Regular Expressions to Parse HTML Part 1
No Comments Yet

Like many web content authors, over the past few years I've had many
occasions when I've needed to clean up a bunch of HTML files that have
been generated by a word processor or publishing package. Initially, I
used to clean up the files manually, opening each one in turn, and
making the same set of updates to each one. This works fine when you
only have a few files to fix, but when you have hundreds or even
thousands to do, you can very quickly be looking at weeks or even
months of work. A few years ago someone put me on to the idea of using
Perl and regular expressions to perform this ‘cleaning up’ process.

Why write an article about Perl and regular expressions I
hear you say. Well, that’s a good point. After all the web is full of
tutorials on Perl and regular expressions. What I found though, was
that when I was trying to find out how I could process HTML files, I
found it difficult to find tutorials that met my criteria. I’m not
saying they don’t exist, I just couldn’t find them. Sure, I could find
tutorials that explained everything I needed to know about regular
expressions, and I could find plenty of tutorials about how to program
in Perl, and even how to use regular expressions within Perl scripts.
What I couldn’t find though, was a tutorial that explained how to open
one or more HTML or text files, make updates to those files using
regular expressions, and then save and close the files.


The Goal


When converting documents into
HTML the goal is always to achieve a seamless conversion from the
source document (for example, a word processor document) to HTML. The
last thing you need is for your content authors to be spending hours,
or even days, fixing untidy HTML code after it has been converted.

Many applications offer excellent tools for converting
documents to HTML and, in combination with a well designed cascading
style sheet (CSS), can often produce perfect results. Sometimes though,
there are little bits of HTML code that are a bit messy, normally
caused by authors not applying paragraph tags or styles correctly in
the source document.


Why Perl?


The reason why Perl is such a
good language to use for this task is because it is excellent at
processing text files, which let's face it, is all HTML files are. Perl
is also the de facto standard for the use of regular expressions, which
you can use to search for, and replace/change, bits of text or code in
a file.


What is Perl?


Perl (Practical
Extraction and Report Language) is a general purpose programming
language, which means it can be used to do anything that any other
programming language can do. Having said that, Perl is very good at
doing certain things, and not so good at others. Although you could do
it, you wouldn’t normally develop a user interface in Perl as it would
be much easier to use a language like Visual Basic to do this. What
Perl is really good at, is processing text. This makes it a great
choice for manipulating HTML files.


What is a Regular Expression?


A regular
expression is a string that describes or matches a set of strings,
according to certain syntax rules. Regular expressions are not unique
to Perl - many languages, including JavaScript and PHP can use them -
but Perl handles them better than any other language.


In part 2, we'll look at our first example Perl script

About the Author: John Dixon is a freelance web developer and technical author.


Go to http://www.computernostalgia.net to read and submit articles and photos relating to the history of the computer


Go to http://www.dixondevelopment.co.uk to find out more about John's work

Article Source: http://EzineArticles.com/?expert=John_Dixon



  Rating
Rate This Page: Poor Great   |  Rate Content |
Average rating:  5   
12345
Number of Ratings : 1
  Comments

 |  View Topic History  |
Copyright 2007 by Perl Pages Forum   Terms Of Use  Privacy Statement