Bioinformatics program

This problem involves lists, Strings, files and moreover object-oriented design and creating a program from scratch using stepwise refinement; your program should be modular. It is based on BLAST (Basic Local Alignment Search Tool), an important tool used by biologists worldwide to compare DNA and protein sequences, in order to answer questions such as 1) How do we decide whether two strings of DNA characters represent the same gene or different genes? 2) If we identify a gene responsible for some particular disease or function, how do we determine whether related genes appear in other species? 3) How do we decide whether two sequences are similar enough to suggest that the organisms descended from a common ancestor? [1]

Read in from the user: 1) a query String (which must consist of characters A, C, T, G representing a sequence of nucleotides), 2) the name of an input file containing a potentially long sequence of nucleotides (characters A, C, T, G) representing the DNA of some organism, and 3) a threshold value that defines a successful match (as described below).

The query String should be validated and its length (n) computed. Store the DNA sequence in an ArrayList, where the element at position i contains a word of length n starting at position i of the DNA sequence. For example, if the DNA sequence in the input file is "ACTGGCCTA" and the query String is "CTG" (a String of length 3), then the list will look like this: <"ACT", "CTG", "TGG", "GGC", "GCC", "CCT", "CTA"> representing all of the n-letter sequences in the DNA file.

Compare the query String with each substring in the list, computing a score for each comparison. Assign a value of +1 to two identical characters (in the same position) and a score of -1 to mismatched characters. Assuming the above example, the scores would be as follows.

Report the results: all "matches" that meet or surpass the user-specified threshold, in sorted order from highest to lowest score. Indicate also the position of the matched substring within the DNA sequence. Assuming the above example, here is a sample output:

Query String: CTG
DNA file name: ecoli.txt
Threshold: +1

DNA substring  Position  Score
-------------  --------  -----
CTG                1       +3
CTA                6       +1

Write the list containing DNA substrings to a binary file and the next time the program runs, give the user the option of reading the info from this file directly into the list rather than starting from scratch with the text file containing ASCII characters. This feature is especially useful if the DNA sequence is very long and the user wants to run the program multiple times on the same DNA sequence. Refer to pages 289-290 of the third edition (or pages 269-270 of the second edition or pages 201-202 of the updated edition) to see how to write an object to a file and how to read an object from a file.

References

[1] Cutter, Pamela. "Having a BLAST: A Bioinformatics Project in CS2," SIGCSE 2007.
[2] http://en.wikipedia.org/wiki/BLAST
[3] http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
[4] http://130.14.29.110/BLAST/