How to reduce the impact of your PAF file on your disk by 95%

10 minute read 11 October 2018

Last week Shaun Jackman posted this tweet:

For people who did not work on long-read assembly, for correction or assembly graph construction, we need to map the reads against each other. Minimap2 is a very good mapper used to find similar regions between long reads. Its output are PAF files (Pairwise Alignment Format) and are summarized on minimap2 man page. Roughly, it is a tsv file which stores, for each similar region found, (called before match): reads names, reads length, begin and end positions of match, plus some other information.

This tweet prompted a discussion, and a third solution was proposed:

  • using classical mapping against reference compression format
  • filter some matches
  • the creation of a new binary compressed format to store all-vs-all mapping

Use mapping reference file

Torsten Seemann and I suggest using special a flag to get minimap2 output in SAM and compress it in BAM/CRAM format. But apparently it isn't working well.

OK even if I have removed unecessary field in SAM format (sequence and mapping field), and with better compression solution, it isn't better than PAF format compression with gzip. Maybe with a larger dataset, a CRAM file could be better than a gzipped PAF file, but I am not convinced by this solution.

Filter

Heng Li said:

Minimap didn't make any assumption about what the user wants to do with read matching, and it is a very good thing, but sometimes you store too much information for your usage. So filtering overlaps before storing them on your disk could by a good idea.

Awk, Bash, Python, {choose your language} script could do this job perfectly.

Alex Di Genov suggest using minimap2 API to build a special minimap with integrating filters. This solution has probably better performance than ad hoc script but it's less flexible, can't be applied to other mapper.

My solution

It's little soft in rust fpa for Filter Pairwise Alignment, they takes as input pairwise align in the PAF or MHAP, and they can filter match by:

  • type: containment, internal-match, dovetail
  • length: match is upper or lower than a threshold
  • read name: match against a regex, it's a read match against himself

fpa is available in bioconda and in cargo, you can found more information on github, and you can use it in your pipeline like gzip :

minimap2 -x ava-ont reads.fa reads.fa | fpa -d --compression-output gzip > only_dovetails_overlaps.paf.gz

OK filtering matches is easy and we have many available solution.

A Binary Alignment Format

This is the question they initiate this blog post.

jPAF is only a POC I create them to test things on how to compress this type of data. I introduce to you now because the results already seem very interesting. However, there is still a lot of work to do, before release a good binary pairwise alignment format.

jPAF is a json file that contains the same information as a PAF file, but it is reorganized to save space, so it isn't really a binary.

We have 3 main objects in this json:

  • header_index: a dict they associate an index to a header name
  • reads_index: associate a read name and its length to an index
  • matches: a list of matches, a match header_index

A little example is probably better:

PAF version :

1_2    5891    1642    5889    +    2_3    4490    1    4248    4247    4247    0    tp:A:S

jPAF version:

{
   "header_index":{
      "0":"read_a",
      "1":"begin_a",
      … # index associate to the paf header name
   },
   "read_index":[
      {
         "name":"1_2",
         "len":5891
      },
      {
         "name":"2_3",
         "len":4490
      },
   ],
   "match":[
      [
         0,     # index of read in read_index table
         1642,  # begin position for read A
         5889,  # end position of read A
         "+",   # read have same strand
         1,	# index of read in read_index table
         1,	# begin position of read B
         4248,	# end position of read B
         4247,	# the other fields of the paf format
         4247,
         0,
         "tp:A:S"
      ],
   ]
}

In this example we didn't save space but I demonstrate in result section how jPAF performance grow up with size of PAF. Here we have two reads 1_2 and 2_3, with 5891 and 4490 bases respectively (store in read_index object), and one overlap with length 4247 bases in the same strand between them (store in match object).

jPAF are fully inspired by PAF, and have same number of fields, and same field names. I just take the PAF, convert it in json and add two little tricks to save space.

  • First trick writes read names and read length one time.
  • Second trick is more of a json trick. At first, each record was a dictionary with a keyname associate to a value, but with this solution jPAF was heaviest than PAF. However, if I associate a field name to an index, I can store records just in classical table and avoid redundancy.

OK, I have a pretty cool format, to avoid some repetition without loss of information, but do I really save space?

Result

Dataset: For this, I reuse the same dataset as my previous blog post. It is composed ot two real E. coli datasets: pacbio and nanopore.

Mapping: I run minimap2 mapping with preset ava-pb and ava-ont for pacbio and nanopore respectively.

If you want to replicate these results, just follow the instructions avaible at github repro. Full data are avaible here: nanopore, pacbio.

basic.sam designates the minimap2 output in sam format, short.sam designates the minimap2 output with then SEQ, QUAL fields replaced by a '*'.

PAF against jPAF

Nanopore:

pafpaf.gzpaf.bz2paf.xz
jpaf64.65 %
jpaf.gz94.73 %64.13 %66.11 %22.01 %
jpaf.bz294.42 %62.01 %64.11 %17.41 %
jpaf.xz95.84 %71.70 %73.26 %38.47 %

Pacbio:

pafpaf.gzpaf.bz2paf.xz
jpaf71.04 %
jpaf.gz92.57 %50.36 %34.08 %28.79 %
jpaf.bz293.78 %58.43 %44.79 %40.36 %
jpaf.xz93.73 %58.14 %44.41 %39.95 %

If I compare PAF against jPAF compressed with lzma I win 95.84%. I have a justification for my title, it's 99% when I removed same read, containment, internal, less than 500 bp matches with fpa.

It's less impressive but more accurate and realistic. At the same compression level, I earn between 71.04% and 38.47%. We can notice a decrease of the efficacity of jPAF against PAF when the compression algorithm becomes better.

Impact of size of input PAF on jPAF compression ratio

On the horizontal axis, the number of PAF matches is ordered by the percentage of space saved by converting it into jPAF (uncompressed). The horizontal axis is logarithmic. I build this curve on nanopore dataset. We note that after two records the jPAF is better than PAF but we reach the ratios found in complete dataset after 2^19 (262,144) records.

Effect of compression on each format

Nanopore:

rawgzbz2xzbamcram
pafsize2.2G317M336M146M
saved space85.31%84.45%93.24%
jpafsize762M114M121M90M
saved space85.09%84.21%88.24%
basic.samsize25G5.8G5.2G
saved space76.45%78.57%
short.samsize24G5.5G4.8G
saved space76.61%79.55%

Pacbio:

rawgzbz2xzbamcram
pafsize2.4G361M272M252M
saved space85.04%88.73%89.57%
jpafsize698M179M150M151M
saved space74.36%78.52%78.38%
basic.samsize24G8.4G4.5G
saved space64.44%81.28%
short.samsize24G8.4G4.4G
saved space64.41%81.05%

BAM/CRAM compression aren't better than classical compression format (confirm preliminary result). Even the compression ratios are quiet similar in each format. This table confirms what we observed in previous section: compression ratios are better on PAF than jPAF, but jPAF is still smaller than PAF.

Discussion

Minimap generates to much matches, but it's easy to remove unusual matches, with fpa or with ad hoc tools. The format designed to store mapping against reference isn't better than actual format compressed with a generalist algorithm.

The main problem of jPAF is also its directly related to its main quality: read_index allows to save disk space, but at the cost of time and RAM. You can't stream out this format, you need to wait until all alignments are found before writing. If you want to read the file, you need keep read_index in RAM all time.

But I written it in two hours. The format is lossless and they save 33 to 64% of disk space. Obviously we could save even more space by moving from a text format to a binary format, especially in the numeric and strand fields (7 bit per match). But it's easier to test things with a text file.

In conclusion, the solutions of Shaun Jackman trouble are probably filtering of match, but in a future with more and more long-read data, we probably need a binary compress format. This result encourages me to continue on this approch, and I need to read more things about BAM/CRAM compression technics, canu ovlStore.

Acknowledgements

For proofreading:

Comments: