How to reduce the impact of your PAF file on your disk by 95%

9 minute read 11 October 2018
Last week Shaun Jackman posted this tweet:

For people who did not work on long-read assembly, for correction or assembly graph construction, we need to map the reads against each other. Minimap2 is a very good mapper used to find similar regions between long reads. Its output are PAF files (Pairwise Alignment Format) and are summarized on minimap2 man page. Roughly, it is a tsv file which stores, for each similar region found, (called before match): reads names, reads length, begin and end positions of match, plus some other information.

This tweet prompted a discussion, and a third solution was proposed:

  • using classical mapping against reference compression format
  • filter some matches
  • the creation of a new binary compressed format to store all-vs-all mapping

Use mapping reference file

Torsten Seemann and I suggest using special a flag to get minimap2 output in SAM and compress it in BAM/CRAM format. But apparently it isn't working well.

OK even if I have removed unecessary field in SAM format (sequence and mapping field), and with better compression solution, it isn't better than PAF format compression with gzip. Maybe with a larger dataset, a CRAM file could be better than a gzipped PAF file, but I am not convinced by this solution.

Filter

Heng Li said:

Minimap didn't make any assumption about what the user wants to do with read matching, and it is a very good thing, but sometimes you store too much information for your usage. So filtering overlaps before storing them on your disk could by a good idea.

Awk, Bash, Python, {choose your language} script could do this job perfectly.

Alex Di Genov suggest using minimap2 API to build a special minimap with integrating filters. This solution has probably better performance than ad hoc script but it's less flexible, can't be applied to other mapper.

My solution

It's little soft in rust fpa for Filter Pairwise Alignment, they takes as input pairwise align in the PAF or MHAP, and they can filter match by:

  • type: containment, internal-match, dovetail
  • length: match is upper or lower than a threshold
  • read name: match against a regex, it's a read match against himself

fpa is available in bioconda and in cargo, you can found more information on github, and you can use it in your pipeline like gzip :

minimap2 -x ava-ont reads.fa reads.fa | fpa -d --compression-output gzip > only_dovetails_overlaps.paf.gz

OK filtering matches is easy and we have many available solution.

A Binary Alignment Format

This is the question they initiate this blog post.

jPAF is only a POC I create them to test things on how to compress this type of data. I introduce to you now because the results already seem very interesting. However, there is still a lot of work to do, before release a good binary pairwise alignment format.

jPAF is a json file that contains the same information as a PAF file, but it is reorganized to save space, so it isn't really a binary.

We have 3 main objects in this json:

  • header_index: a dict they associate an index to a header name
  • reads_index: associate a read name and its length to an index
  • matches: a list of matches, a match header_index

A little example is probably better:

PAF version :

1_2    5891    1642    5889    +    2_3    4490    1    4248    4247    4247    0    tp:A:S

jPAF version:

{
   "header_index":{
      "0":"read_a",
      "1":"begin_a",
      … # index associate to the paf header name
   },
   "read_index":[
      {
         "name":"1_2",
         "len":5891
      },
      {
         "name":"2_3",
         "len":4490
      },
   ],
   "match":[
      [
         0,     # index of read in read_index table
         1642,  # begin position for read A
         5889,  # end position of read A
         "+",   # read have same strand
         1,	# index of read in read_index table
         1,	# begin position of read B
         4248,	# end position of read B
         4247,	# the other fields of the paf format
         4247,
         0,
         "tp:A:S"
      ],
   ]
}

In this example we didn't save space but I demonstrate in result section how jPAF performance grow up with size of PAF. Here we have two reads 1_2 and 2_3, with 5891 and 4490 bases respectively (store in read_index object), and one overlap with length 4247 bases in the same strand between them (store in match object).

jPAF are fully inspired by PAF, and have same number of fields, and same field names. I just take the PAF, convert it in json and add two little tricks to save space.

  • First trick writes read names and read length one time.
  • Second trick is more of a json trick. At first, each record was a dictionary with a keyname associate to a value, but with this solution jPAF was heaviest than PAF. However, if I associate a field name to an index, I can store records just in classical table and avoid redundancy.

OK, I have a pretty cool format, to avoid some repetition without loss of information, but do I really save space?

Result

Dataset: For this, I reuse the same dataset as my previous blog post. It is composed ot two real E. coli datasets: pacbio and nanopore.

Mapping: I run minimap2 mapping with preset ava-pb and ava-ont for pacbio and nanopore respectively.

If you want to replicate these results, just follow the instructions avaible at github repro. Full data are avaible here: nanopore, pacbio.

basic.sam designates the minimap2 output in sam format, short.sam designates the minimap2 output with then SEQ, QUAL fields replaced by a '*'.

PAF against jPAF

Nanopore:

paf paf.gz paf.bz2 paf.xz
jpaf 64.65 %
jpaf.gz 94.73 % 64.13 % 66.11 % 22.01 %
jpaf.bz2 94.42 % 62.01 % 64.11 % 17.41 %
jpaf.xz 95.84 % 71.70 % 73.26 % 38.47 %

Pacbio:

paf paf.gz paf.bz2 paf.xz
jpaf 71.04 %
jpaf.gz 92.57 % 50.36 % 34.08 % 28.79 %
jpaf.bz2 93.78 % 58.43 % 44.79 % 40.36 %
jpaf.xz 93.73 % 58.14 % 44.41 % 39.95 %

If I compare PAF against jPAF compressed with lzma I win 95.84%. I have a justification for my title, it's 99% when I removed same read, containment, internal, less than 500 bp matches with fpa.

It's less impressive but more accurate and realistic. At the same compression level, I earn between 71.04% and 38.47%. We can notice a decrease of the efficacity of jPAF against PAF when the compression algorithm becomes better.

Impact of size of input PAF on jPAF compression ratio