Exercise Set: Pattern Matching in Perl

Try the following scripts on nucleotide search results such as hivtatresult.txt .
  1. pubmeds.pl prints out all PUBMED lines. The condition ( $line =~ /PUBMED/ ) holds if the line contains the word PUBMED.
  2. Suppose we are interested in extracting the actual PUBMED numbers? The condition ( $line =~ /^\s*PUBMED\s+\d+/ ) means the line should start (^) with zero or more spaces (indicated by a * after \s), followed by the string PUBMED, followed by one or more (indicated by +) spaces, followed by one or more digits. To extract the digits, place parentheses around (\d+), and, whenever there is a successful match, $1 stands for the characters matched by the symbols within the parentheses. The script pubmednums.pl uses this pattern-matching technique to extract PUBMED numbers.
  3. version.pl grabs version reference numbers from the rna text file. Multiple pairs of parentheses can be used to extract several parts from the string ($1, $2, $3, ... stand for the parts of interest, in order). The script uses a last; statement. This causes control to break out of the loop--appropriate in this case because we assume there is only one VERSION line and there is no need to read the succeeding lines.
  4. gettrans.pl extracts the translation string from the file. Notice that the string may occur in multiple lines so it involves detecting the line that contains the string /translation=" and then reading the next lines until a line containing " is read.
Exercises:
  1. (Easy) Write a perl script that prints out the accession code for the file (the file has a single line with the word ACCESSION).
  2. (Moderate) gettrans.pl has a bug. If the translation string occurs in one line (e.g., /translation="MGLSDGEWQLVLNVWGKVEADIP"), it will not work. Correct the perl script so it works even for this case.
  3. (Difficult) Write a perl script that lists all authors indicated in the file. Author information are all those words (could be in separate lines) that occur after the keyword AUTHORS and before the keyword TITLE. Note that there may be several occurences of the keywords AUTHORS and TITLE in the file.