Exercise Set: Pattern Matching in Perl
Try the following scripts on nucleotide search results
such
as
hivtatresult.txt
.
- pubmeds.pl
prints out all PUBMED lines.
The condition ( $line =~ /PUBMED/ ) holds
if the line contains the word PUBMED.
- Suppose we are interested in extracting the actual PUBMED numbers?
The condition ( $line =~ /^\s*PUBMED\s+\d+/ ) means
the line should start (^) with zero or more
spaces (indicated by a * after \s),
followed by the string PUBMED,
followed by one or more (indicated by +) spaces,
followed by one or more digits.
To extract the digits, place parentheses
around (\d+), and, whenever there is
a successful match, $1 stands for the characters
matched by the symbols within the parentheses.
The script pubmednums.pl
uses this pattern-matching technique to
extract PUBMED numbers.
- version.pl
grabs version reference numbers from the rna text file.
Multiple pairs of parentheses can be used to extract
several parts from the string ($1, $2, $3, ... stand
for the parts of interest, in order).
The script uses a last; statement.
This causes control to break out of the loop--appropriate
in this case because we assume there is only one
VERSION line and there is no need to read the
succeeding lines.
- gettrans.pl
extracts the translation string from the file.
Notice that the string may occur in multiple lines
so it involves detecting the line that contains
the string /translation=" and then reading the
next lines until a line containing " is read.
Exercises:
- (Easy) Write a perl script that prints out the accession code
for the file (the file has a single line with the word ACCESSION).
- (Moderate) gettrans.pl
has a bug.
If the translation string occurs in one line
(e.g., /translation="MGLSDGEWQLVLNVWGKVEADIP"), it will not work.
Correct the perl script so it works even for this case.
- (Difficult) Write a perl script that lists all authors
indicated in the file.
Author information are all those words (could be
in separate lines) that occur after the keyword
AUTHORS and before the keyword TITLE.
Note that there may be several occurences of the keywords
AUTHORS and TITLE in the file.