the k-mer graph is a topological representation of the sequencing data from which reads can be reconstructed by assembling consecutive k-mers. ultra-long reads are often parsed by a read mapping program that generates a read k-mer graph in which every node is a read and every edge is a k-mer that occurs in the read. however, the read k-mer graph is frequently incomplete. for example, the reads may not have sufficiently long k-mers to allow reconstruction and an “empty” read may be present in the graph.
we modified the illumina-based graph processing algorithm, burrows-wheeler graph (bwgvg) to construct a graph by connecting reads to their immediate k-mer neighbors that occur within a maximum window size for each read.
mapq replicates were mixed by pipetting up and down ten times, then centrifuged for 10 min at 800 g. the 1.5 ml volume was placed on a new 0.45 μm filter and washed with an additional 2 ml of wash buffer. the filter was discarded after the first run. the entire sample was run in the same manner at least twice to verify reproducibility. this protocol was implemented on the sequencer to ascertain that ligation and size selection worked. the protocol was also used on a well-characterized illumina lane to verify reproducibility. it has been optimized for the nanopore sequencer to yield on average 24.3 m/lane/run, which is 10-fold higher than the current generation (1.1-1.3 m/lane/run) of miseq devices.
the data sets from this paper are available from the ncbi sequence read archive. barcode sequences have been removed for the work described in this manuscript, and other work is in progress to automate processing of sequencing data in order to make it publicly available. ncbi accession numbers are currently unavailable.
the difference between the gm12878 and gm12891 haplotypes can be detected by aligning reads of the two haplotypes to each other. the matched haplotype regions were segmented using a hidden markov model (hmm). hmm calling was performed using the hmm function of meme tool 37, and with default parameters. the emission probabilities (relative number of reads of certain haplotype versus other haplotype) and initial probabilities were preset for the hmm calls. one haplotype was chosen as the reference and the other was assigned to the transition probabilities to the first state.
the haplotype sequences were recovered using minimap2 37, with the maximum allowed i value of 15. to evaluate the quality of the hmm calls, we compared them to the haplotype calls using ont long reads that directly span the segments. to evaluate the ability of ont long reads to bridge the genotype differences between the haplotypes, we aligned the reads to the hmm calls and compared the hmm and long-read-based genotype calls. we identified 15 hmm segments (six haplotypes) that were not supported by ont long reads ( fig. 7a and supplementary table 14 ). these segments were highly concordant among all three haplotype sequences (exemplary alignment, supplementary fig. 6b). the ont long reads could not span the most genetically diverse and complex segment, which involves hla-drb3 and hla-drb4, for example, which also contained several gaps in our contig assembly. we showed that short reads of the same haplotype are well aligned to the haplotype-specific hmm calls ( supplementary fig. 6a and 6b), and the same hmm calling approach correctly recovered the haplotypes from pacbio reads ( fig. 7b).