How to reduce the impact of your PAF on your disk by 95%

Last week Shaun Jackman posted this tweet:

I have a 1.2 TB PAF.gz file of minimap2 all-vs-all alignments of 18 flowcells of Oxford Nanopore reads. Yipes. I believe that's my first file to exceed a terabyte. Is there a better way? Perhaps removing the subsumed reads before writing the all-vs-all alignments to disk?

— Shaun Jackman (@sjackman) 4 octobre 2018

For people who did not work on long-read assembly, for correction or assembly graph construction, we need to map the reads against each other. Minimap2 is a very good mapper used to find similar regions between long reads. Its output are PAF files (Pairwise Alignment Format) and are summarized on minimap2 man page. Roughly, it is a tsv file which stores, for each similar region found, (called before match): reads names, reads length, begin and end positions of match, plus some other information.

This tweet prompted a discussion, and a third solution was proposed:

Use mapping reference file

Torsten Seemann and I suggest using special a flag to get minimap2 output in SAM and compress it in BAM/CRAM format. But apparently it isn’t working well.

My trouble to convert bam in cram is solved thank to @RayanChikhi !

minimap output without sequence compress in cram,noref (i.e. tmp_short.cram) is little bit heavier than classic paf compress in gz.

So it's probably time for a Binary Alignment Format. pic.twitter.com/W02wj7tf2I

— Pierre Marijon (@pierre_marijon) 4 octobre 2018

OK even if I have removed unecessary field in SAM format (sequence and mapping field), and with better compression solution, it isn’t better than PAF format compression with gzip. Maybe with a larger dataset, a CRAM file could be better than a gzipped PAF file, but I am not convinced by this solution.

Filter

Heng Li said:

Minimap2 reports hits, not only overlaps. The great majority of hits are non-informative to asm. Hits involving repeats particularly hurt. For asm, there are ways to reduce them, but that needs evaluation. I won't go into that route soon because … 1/2

— Heng Li (@lh3lh3) 4 octobre 2018

Minimap didn’t make any assumption about what the user wants to do with read matching, and it is a very good thing, but sometimes you store too much information for your usage. So filtering overlaps before storing them on your disk could by a good idea.

Awk, Bash, Python, {choose your language} script could do this job perfectly.

Alex Di Genov suggest using minimap2 API to build a special minimap with integrating filters. This solution has probably better performance than ad hoc script but it’s less flexible, can’t be applied to other mapper.

My solution

It’s little soft in rust fpa for Filter Pairwise Alignment, they takes as input pairwise align in the PAF or MHAP, and they can filter match by:

fpa is available in bioconda and in cargo, you can found more information on github, and you can use it in your pipeline like gzip :

minimap2 -x ava-ont reads.fa reads.fa | fpa -d --compression-output gzip > only_dovetails_overlaps.paf.gz

OK filtering matches is easy and we have many available solution.

A Binary Alignment Format

Is it time for a Binary Alignment Format that uses integer compression techniques?

— Rayan Chikhi (@RayanChikhi) 4 octobre 2018

This is the question they initiate this blog post.

jPAF is only a POC I create them to test things on how to compress this type of data. I introduce to you now because the results already seem very interesting. However, there is still a lot of work to do, before release a good binary pairwise alignment format.

jPAF is a json file that contains the same information as a PAF file, but it is reorganized to save space, so it isn’t really a binary.

We have 3 main objects in this json:

A little example is probably better:

PAF version :

1_2    5891    1642    5889    +    2_3    4490    1    4248    4247    4247    0    tp:A:S

jPAF version:

{
   "header_index":{
      "0":"read_a",
      "1":"begin_a",
      … # index associate to the paf header name
   },
   "read_index":[
      {
         "name":"1_2",
         "len":5891
      },
      {
         "name":"2_3",
         "len":4490
      },
   ],
   "match":[
      [
         0,     # index of read in read_index table
         1642,  # begin position for read A
         5889,  # end position of read A
         "+",   # read have same strand
         1,	# index of read in read_index table
         1,	# begin position of read B
         4248,	# end position of read B
         4247,	# the other fields of the paf format
         4247,
         0,
         "tp:A:S"
      ],
   ]
}

In this example we didn’t save space but I demonstrate in result section how jPAF performance grow up with size of PAF. Here we have two reads 1_2 and 2_3, with 5891 and 4490 bases respectively (store in read_index object), and one overlap with length 4247 bases in the same strand between them (store in match object).

jPAF are fully inspired by PAF, and have same number of fields, and same field names. I just take the PAF, convert it in json and add two little tricks to save space.

OK, I have a pretty cool format, to avoid some repetition without loss of information, but do I really save space?

Result

Dataset: For this, I reuse the same dataset as my previous blog post. It is composed ot two real E. coli datasets: pacbio and nanopore.

Mapping: I run minimap2 mapping with preset ava-pb and ava-ont for pacbio and nanopore respectively.

If you want to replicate these results, just follow the instructions avaible at github repro. Full data are avaible here: nanopore, pacbio.

basic.sam designates the minimap2 output in sam format, short.sam designates the minimap2 output with then SEQ, QUAL fields replaced by a ‘*’.

PAF against jPAF

Nanopore:

  paf paf.gz paf.bz2 paf.xz
jpaf 64.65 %      
jpaf.gz 94.73 % 64.13 % 66.11 % 22.01 %
jpaf.bz2 94.42 % 62.01 % 64.11 % 17.41 %
jpaf.xz 95.84 % 71.70 % 73.26 % 38.47 %

Pacbio:

  paf paf.gz paf.bz2 paf.xz
jpaf 71.04 %      
jpaf.gz 92.57 % 50.36 % 34.08 % 28.79 %
jpaf.bz2 93.78 % 58.43 % 44.79 % 40.36 %
jpaf.xz 93.73 % 58.14 % 44.41 % 39.95 %

If I compare PAF against jPAF compressed with lzma I win 95.84%. I have a justification for my title, it’s 99% when I removed same read, containment, internal, less than 500 bp matches with fpa.

It’s less impressive but more accurate and realistic. At the same compression level, I earn between 71.04% and 38.47%. We can notice a decrease of the efficacity of jPAF against PAF when the compression algorithm becomes better.

Impact of size of input PAF on jPAF compression ratio