Skip to content

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

@ilyasoifer

Description

@ilyasoifer

Bug Report

Affected tool(s) or class(es)

gatk PrintReads, picard MarkDuplicates

Affected version(s)

  • 4.4.0.0 (htsjdk 3.0.5, picard 3.0.0)

Description

I apologize in advance if this issue belongs to htsjdk. When we work with some of the CRAMs and pass them through PrintReads or picard MarkDuplicates, "N" bases get introduced.

We think that the problem happens when PrintReads write the CRAM rather than reading it, because if the output of PrintReads is a BAM, it does not happen.

We also noticed that this issue does not happen with earlier GATK (4.2.6.1), HTSJDK 2.24.1.

Happy to share the input files

Steps to reproduce

gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.cram -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.bam -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
samtools view ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	AS:i:54	X1:i:0	XD:Z:CATGAGG_GGTGATC	a3:i:92	bi:Z:5346565226	f1:Z:CATGAGG	f2:Z:GGTGATC	i1:Z:Q44	i2:Z:Q27	pr:i:22	pt:i:15	px:i:3813	py:i:1262	rq:f:0.03	si:i:3750	tm:Z:AQtq:i:195	MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0	NM:i:54	RG:Z:Z0011


samtools view ultMerge.mt.gatk_printreads.bam | grep "038958_1-Z0011-5346565226"

038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	X1:i:0	f1:Z:CATGAGG	i1:Z:Q44	f2:Z:GGTGATC	i2:Z:Q27	a3:i:92	XD:Z:CATGAGG_GGTGATC	RG:Z:Z0011	AS:i:54	bi:Z:5346565226	si:i:3750	tm:Z:AQ	rq:f:0.03	tq:i:195	pr:i:22pt:i:15	px:i:3813	py:i:1262

Expected behavior

BAM and CRAM outputs should behave the same

Actual behavior

BAM and CRAM outputs behave differently

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions