I recently discovered that for many sequences they have additional
data at genbank. (link)
http://www.ncbi.nlm.nih.gov/Traces/t...CODE%3D'RT-PCR'
When they sequence a flu-genome, they get ~200 partial subsequences
from "random" positions in the genome of length ~500-800 nucleotides.
From these subsequences the genome is assembled, they usually overlap
and every position is covered multiple times.
And for ~9000 flu-genomes (7000 with ftp) these sets of subsequences
are also available.
I took the ~600 sets from wild birds only, filtered those that for some reason
didn't work with my programs and 435 were remaining for which I calculated
the alignments, the average number of subsequences that cover a position
(green in the pic) and the probability that the finally assigned nucleotide
(I don't know how they assign that value) differs from the average
(black in the pic)
I noticed that the region in H2 was more often covered than the others,
maybe subsequences break and start preferrably in that region
======================================
data at genbank. (link)
http://www.ncbi.nlm.nih.gov/Traces/t...CODE%3D'RT-PCR'
When they sequence a flu-genome, they get ~200 partial subsequences
from "random" positions in the genome of length ~500-800 nucleotides.
From these subsequences the genome is assembled, they usually overlap
and every position is covered multiple times.
And for ~9000 flu-genomes (7000 with ftp) these sets of subsequences
are also available.
I took the ~600 sets from wild birds only, filtered those that for some reason
didn't work with my programs and 435 were remaining for which I calculated
the alignments, the average number of subsequences that cover a position
(green in the pic) and the probability that the finally assigned nucleotide
(I don't know how they assign that value) differs from the average
(black in the pic)
I noticed that the region in H2 was more often covered than the others,
maybe subsequences break and start preferrably in that region
======================================