Sequence Data Analysis

Genome Assembly and Annotation

My thanks go to Dr. Rob Edwards who initially developed this module and whose work I have heavily utilized in developing this module.

So far in this class you’ve been exposed to some the most recent advances in genomics, proteomics and bioinformatics. In a sense, we are now taking a step back to what got it all started – DNA sequencing and the reconstruction of (almost) entire genomes.

In the last 30 years, DNA sequencing has evolved from a laborious process in which one was fortunate to produce a few thousand nucleotides of good quality data after a week of hard work, to one in which one simply sends out a plate of samples to a core facility and produces several thousand nucleotides of data in a few hours (or thousands of kilobases of a data at a large facility).

The dominant method of DNA sequencing involves labeling the DNA with four different fluorescent dyes, separating the DNA by electrophoresis, and measuring the intensity of the fluorescence to determine the base order. By the most common methods, this results in a “trace” file that records the wavelength of the DNA fragments as they pass the laser that excites the dye. Just about any DNA sequencing machine will infer the linear DNA sequence from the raw fluorescence data. There are a small variety of formats for storing and presenting this data, and the most popular one is the Standard Chromatogram File, or scf. This is the format we will be using. There are a variety of programs that will read these files and perform some “post-processing” of the data, often a re-inference of the linear DNA sequence and assignment of quality scores to the “call”, or assignment of nucleotide at each site.

For genome sequencing, a suite of tools written by researchers at U. of Washington, Seattle is most commonly used. These programs, Phred, Phrap,and Consed are the most widely used for sequence data analysis. Variants of this method have been incorporated into other computer packages. You are invited to visit the UW-Seattle site to learn more or download the programs. However, we will not be running these programs. Instead, we will use some data that has already been partially pre-processed – we will perform the final necessary processing to get the final usable data.

I have provided a compressed file with 265 trace files that are part of the whole genome sequence (>134,000 traces) of the fascinating bacterium Geobacter metallireducens. This data is from the trace file archive at NCBI, and the data was originally produced at the DOE Joint Genome Institute.

Sequencher

Sequencher is one of the more popular programs for single-lab and small genome sequencing projects. Eukaryotic genome projects and genome sequencing centers generally use more industrial-strength (and less user-friendly) packages of programs. Sequencher is available for mac or pc.

Using Sequencher

We are using the demo version of Sequencher. Purchased versions of Sequencher have a physical or network key that enable it to run in full mode.The demo version will do almost everything the full version will, except such things that are vital to performing research such as printing, saving, and exporting analyses.

Extract the file “Course_Seqs.zip” using the winzip utility in Windows XP (right-click and “Extract All”). This should produce a folder of 265 files, all of which end with .b1.scf, although the .scf part may not be displayed.

Next, launch Sequencher demo. Dismiss the notification that you are running a demo version. Go to File>Import>Folder of Sequences. Navigate to “Course_Seqs” folder. Select it and approve the importation of 265 files. You should have something that looks like this:

 

(note that I am showing the mac version of each window because the demo version of Sequencher interferes with copying screen shots to the clipboard)

 In the first column there is the name of the sequence and an icon that indicates that we are looking at a sequence – new icons will appear once we assemble the sequences into contigs.  The second column contains size of the sequence in bp. The third column gives an average quality value for the entire sequence as a percent.  The remaining columns give details about that specific sequence.

Click on the header for the Quality column. Your window should sort the sequences like so:

Click somewhere off of one of the sequences to deselect them. Now, double-click on the top sequence. You will see the text representation of the sequence with a background of different shades of blue-green. What does the coloring mean? Under the window menu, select User Preferences and then click on “Confidence” under the list of “General” preferences. You will see the confidence ranges and colors assigned to them.

The score is a number between 1 and 60, where the number represents the power to 10.  So a score of 20 means that the peak is incorrectly called one time in 100 (102) and a Phred score of 45 means that sequence is wrong one time in 31622.8 times (104.5).  In general a score above 20 means that the sequence is reliable. Close that window.

Chromatograms

Let’s look at the some of the underlying data. At the top of the window showing the text sequence, click the “Show Chromatogram” button. You will see something like this:

As you move along the X axis from line-to-line, this window shows the level of intensity of fluorescence of each of the four dyes. Each dye corresponds to a different nucleotide, as the legend to the left shows. During electrophoresis, the DNA sequence sample is moving past a detector that records the fluorescence – hence the data are continuous and you get these “flowing” peaks. Ideally, each peak should be discrete from the others and represent a distinct base. However, there is always overlap among peaks to some extent.

Note that in the text, the first several nucleotides are dark blue (low score). You can see why – we have peaks (dyes for different nucleotides) overlapping extensively. A little further into the sequence, the peaks overlap very little and the confidence scores are higher. Close the windows for this sequence and scroll to the last sequence listed (according to “Quality”). Double-click on it and view its chromatogram. You see that the peaks are often less distinct and there are many more “secondary peaks.” These are small peaks near the baseline that can complicate the ability of the machine to call the bases correctly. Confidence and Quality scores are heavily influenced by this. Scroll through the sequence to see the start, middle and end. You will note that no matter which sequence you examine, the early part of the sequence is usually of better quality than the end – this is mainly a consequence of the fact that it is more difficult to make the fine distinction between DNA strands that differ by one nucleotide in length as the overall length of the DNA strands increases. Additionally, the secondary peaks become much harder to distinguish from the primary peaks. (By the way, we won’t get into this, but in diploid organisms the presence of 2 peaks at the same location may indicate that the sample is from a heterozygote at that site)

If you go back to the sequence with the highest Quality, you see that after the first 20 called nucleotides, the peaks are fairly clean and one readily agrees with the machine calls. However, in the first 20 or so calls, there are probably some bases that should not have been called at all due to overlap in peaks or very little distinction from background. You can manually delete nucleotides, change the base that was called, or insert/remove indels (=insertion/deletions). Indels are inserted via the tab or shift-tab combination – this is more relevant once one is examining contigs.

In trying to manually edit sequences, it sometimes helps to change the height of the peaks in that region. This is done using the slider bar on the left:

Play with it to see how it influences how well you can perceive secondary peaks.

Sequencher has one other feature on this icon that helps to view secondary peaks. By clicking on the A, G, C, or T you can hide those traces. So the same view with only the T-trace showing you get an image like this:

Notice that all the other bases are shown in italics to emphasize that those traces are hidden. You can see that there probably should be a T at the first of the two N's.

These ambiguous peaks are sometimes called secondary peaks. Sequencher has a facility for calling the bases for secondary peaks. If you pull down the sequence menu and choose "Call secondary peaks..." you will see a similar dialog appear:

The slider allows you to vary the sensitivity of the secondary peaks, while the other checkboxes allow you to control which peaks are called, if a standard reference that you provide is used for comparison, and how much of the sequence is re-examined. Try this with some of the sequences, especially in a region that you think is ambiguous. Alternate between the chromatogram and text views. When secondary peaks get called, you will see that the altered bases are flagged in a different color (pink) and the letters may have changed. Here are examples:

 

The following table shows the IUPAC codes for unambiguous and ambiguous nucleotides. These codes are used mainly in two situations: 1) when the data quality does not permit calling of a base and 2) when more than one nucleotides really does exist at a site, for example in a heterozygote.

symbol

base

 

symbol

base

A

adenosine

 

M

A C (amino)

C

cytidine

 

S

G C (strong)

G

guanine

 

W

A T (weak)

T

thymidine

 

B

G T C

U

uridine

 

D

G A T

R

G A (purine)

 

H

A C T

Y

T C (pyrimidine)

 

V

G C A

K

G T (keto)

 

N

A G C T (any)

 

 

 

-

gap of indeterminate length

Close the chromatogram window, but leave open the text view of the sequence. Click the “Overview” button at the top of the window. You should see something like this:

As the legend indicates, this shows the start and stop codons in all three possible reading frames. It would also show other features (similar to those in a GenBank file), if they were defined. Click the “”Bases” button to get back to the text view. Now, click the “cut map” button to get the view below. This shows where various restriction enzymes could cut this sequence, as well as those that are not expected to cut it. This can be useful if one wants to insert this sequence into a vector, or wants to avoid cutting this region of the genome in a digest of genomic DNA. Return to the “Bases” view.

The Ruler button provides a ruler with a number of different functions as shown below:

There are 15 buttons on the ruler:

  1. single stranded DNA sequence
  2. double stranded DNA sequence
  3. single stranded DNA sequence with translation
  4. translation frame (1, 2, 3 or all)
  5. space every 3 nucleotides
  6. space every 5 nucleotides
  7. space every 10 nucleotides
  8. space every 20 nucleotides
  9. no spaces
  10. single space between lines
  11. double space between lines
  12. triple space between lines
  13. DNA in lower case
  14. DNA in upper case
  15. Change font

Playing around with the buttons, you could get a three frame translation of the sequence like this:

 

Assembling & Trimming Sequences

Assembling sequences – 1st run.

The 265 sequences we have are a very small subset of the >134,000 traces used to construct the entire genome sequence. Even at that, if we were to try to compare each sequence to the others visually, this would be an almost insurmountable task. This is where Sequencher does its most useful work. Look above the main window in Sequencher at the buttons:

We have the option of trying to piece together the sequences by various means, some based more on previous knowledge than others. We will blindly try to assemble the pieces into contiguous sequences (contigs). Click “Assemble Automatically.”

Sequencher compares each sequence to the others, considering each strand. On my pc 102,850 comparisons were made in 3 minutes 51 seconds to construct 81 contigs and leave 22 fragments unconnected to others. If we had tried to construct the entire genome within Sequencher (assuming memory were adequate), the process would probably take weeks. Sequencing centers who focus on whole genomes need considerable computing power – usually a cluster of fast CPUs.

You probably need to scroll down a bit in the main window to see the list of contigs. You will note the length column now shows the length of the alignments, not the individual sequences anymore. Double-click on one of them and you will see a display like this:

This shows where the sequences overlap, the orientation (arrows), and where start/stop codons are in the three possible reading frames (see the legend). Green arrows represent the “forward” strand (the one you input), while red arrows indicate the sequences had to be reverse-complemented.

Click on the “Bases” button at the top. You will get a view such as this (I scrolled into the alignment a bit):

Your results may differ somewhat from mine depending on how much you played with some of the earlier options we explored.

Click on one of the  letters in the bottom sequence – this is the consensus of the aligned sequences. This should highlight the same position in all of the aligned sequences. Note that dots represent places where a real difference appears to exist among the sequences and plus (+) signs indicate regions where disagreements involve some ambiguity in one sequence.

Find one of these regions of disagreement and click on that letter in the bottom sequence. Now click on the button “Show Chromatogram.”

In many cases, like this one, the disagreement mainly exists because one or more sequences is of poor quality in that region and bases could not be called with high accuracy. If you wanted to, you could edit the alignment within this view by highlighting the base you want to change and typing the new letter. That is what I have done here:

Well, that was fun. Now, get out of any sequence/contig specific windows you have opened and select all the sequences/contigs. Under the Contig menu, select “Dissolve contig.” You will be warned what you are about to do – go ahead and dissolve them.

Trimming Vector

We are going to trim the sequence using two different criteria. In actual practice, these two trimming steps would be done in reverse order or simultaneously but doing it this way will probably illustrate the points better.

This organism was sequenced via shot-gun sequencing in which the genome was cut or randomly sheared into small fragments that were then inserted into a cloning vector. Most cloning vectors are in some way derived from m13, pUC, or Bluescript. In the polycloning region of these vectors are recognition sites for several restriction enzymes. In the case of this organism, the genomic DNA was inserted into a SmaI site.

Close any chromatogram or text views of the sequences and then select all of the sequences (ctrl-A, or Select>Select All)

Now, under the Sequence menu, select “Trim Vector…” You’ll get a window like this, and you should click “Choose Insertion Site Now.”

This will bring up a window like this:

 

Before you can begin to trim the sequence, you need to select the vector in the Vecbase file. Click the top button for “Use Vecbase file.” Your options will be more restricted than shown here, but select BlueKSm.

Now you see the sequence of the polycloning (or, polylinker) region. Since SmaI was used, click on it.

The SmaI site should now be highlighted in green. Click OK.

Now, we don’t need to save any information on this, so just click the red and white “close” button in the upper right of the window. You should be back at the list of sequences. Now, again go to the Sequence menu and select “Trim Vector…” This time it will search for matches to the polylinker region sequence and SmaI site and flag those for trimming (=removal from further analysis).

Now we see that many sequences still retained some of the vector sequence because the primer used for sequencing was specific for the cloning vector, not the insert. Go ahead and click “Trim Checked Items.” When the warning comes up that this is undoable, click “Trim.” Then close the trimming vector window.

By the way, if things don’t seem to be working correctly, let me know. In the worst-case scenario, you can shut the program down and re-import the sequences in a new run of the program.

 

OK, let’s rebuild the contigs – do you recall how?

Hmm…  this time 104,464 comparisons were made, 75 contigs constructed and it took 2 minutes 23 seconds on my pc (mac results are shown here). Why do you think we got fewer contigs and it took less time?

You already know how to view and edit the contigs, so let’s move on to the next type of trimming that is typically done. Dissolve all the contigs.

Under the Sequence menu, select “Trim Ends” This will automatically start the process. You will see that relatively less sequence is trimmed this time – this is due to two factors: 1) much of the poorer quality sequence was removed with the vector sequence and 2) I already weeded out many of the truly bad sequences.

At the top of the “Ends Trimming” window, click the “Change Trim Criteria” button. You will see a window like this:

This allows you to fairly specifically determine how you want to automatically remove problem sequence areas. The defaults are usually good enough, but if you have a particular type of problem with your sequences (i.e., the peak intensity tends to drop off sharply after 500 bases, primer “blobs” obscure good sequence), you can partially address that here. Go ahead and get out of this window and “Trim Checked Items.”

Select all the sequences, and “Assemble Automatically.” On my pc, this took 1 minute 53 seconds for 104,964 comparisons to build 73 contigs. Peruse some of them. It varies from contig to contig, but in general there are fewer disagreements among the sequences and fewer places where several dots are stretched across extensive areas of disagreement.

Microbial Genome Annotation

Gene prediction and genome annotation has become a very complex field that we will not have time to adequately explore. Microbes, not having introns or extensive intergenic regions in general, are easier for gene prediction and annotation than are eukaryotes, but still quite a cottage industry has arisen in the development of software. We will explore the online version of one of the popular packages. SoftBerry sells a fully functional version of fgeneb, but we will use the more limited online version. Follow this link.

We will use the website to predict where genes are located and the predicted amino acid sequence. The file we will use is the first 20,000 nucleotides of the Geobacter metallireducens genome sequence. This is the fasta file we will use: “Gmetallireducens1to20000.fasta”. It should be located on your computer. At the SoftBerry fgenesb site, you should see these controls:

Open the fasta file in Word and paste the sequence into the textbox (I sometimes have problems with browsing to the file and uploading it). In the pull-down menu for selecting the closest organism, select Geobacter sulfurreducens PCA”, as shown. Click on process.

Your results should look similar to this:

You may also get an assignment of genes to operons (3 in this case, although this run did not produce that information).

This shows that 14 genes were predicted in this region. Predictions 2 and 8 are unusually short proteins and are probably wrong (no experimental evidence has been found that they are expressed as proteins). The file “Gmetallireducens1to20000.gb” shows the GenBank annotation of this region for your comparison. Not all of the start and stop positions agree between the prediction above and the GenBank file. However, in many cases the information in the GenBank file is based solely on computational inference and may be wrong. The problem of overprediction and correct identification of start and stop codon is a widespread one in genome annotation.

In practice, genome sequencing centers and research groups use automated genome annotation methods (including the commercial version of fgenesb). However, this can be done simply by use of the above information and BLAST searches of the NCBI database. Try doing BLAST searches with some of the predictions to see if you get a hit to the G. metallireducens genome, to proteins in other species with similar names, or to completely unrelated species and proteins.