MergeSVEvidence by tedsharpe · Pull Request #7695 · broadinstitute/gatk

tedsharpe · 2022-02-23T20:59:34Z

No description provided.

mwalker174 · 2022-03-01T20:34:42Z

@droazen May want someone from engine team to look at this since it touches some core classes.

mwalker174

Thanks @tedsharpe this looks good so far. You've done the hard part of finagling the evidence feature classes into a nice framework for this. I've taken a first pass, excluding tests, and I just had two main comments:

The tool currently just merges records in dictionary order and subsets samples when necessary. I probably wasn't clear on everything the tool should do, but I have a comment below outlining the cases when Features actually need to be merged as well (e.g. RD records at the same interval with different samples). It would also be nice to have some kind of check for collisions - for example the same sample and interval defined in two of the input files is probably unexpected and should result in an error.
It seems like there's a lot of overlap with the MultiVariantWalker and MultiVariantDataSource classes. IMO, VariantContext implements Feature and therefore it seems redundant to have completely separate classes for Features as well. Is there a way to consolidate the code such that MultiVariantWalker extends FeatureMergingWalker and/or defining a MultiFeatureDataSource extended by MultiVariantDataSource? @droazen any thoughts?

mwalker174 · 2022-03-01T20:30:31Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVFeaturesHeader.java

 import java.util.List;

-public class FeaturesHeader {
+public class SVFeaturesHeader {


Not sure why you needed to move and rename this, although I can see the class isn't used anywhere currently. Maybe it will make more sense as I go along.

It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
But there's no particularly compelling reason to change it, and it can be reverted.

Ok I'm fine with it if it's only used for SV classes

mwalker174 · 2022-03-01T20:41:35Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java

+    public Set<String> getSampleNames() { return samples; }
+
+    private void setDictionaryAndSamples() {
+        dictionary = getMasterSequenceDictionary();


Why not use getBestAvailableSequenceDictionary() instead? Rather than defining special logic here (although you would need to modify it to look at the features headers)

That method pukes if there are multiple dictionaries -- even if they're equivalent. And trying to scrape an incomplete dictionary from the index compounded the problem.
This method allows multiple dictionaries if they're all subsets (with respect to name and order) of the largest dictionary.

I see, I think a comment for this function explaining that logic would be helpful. IMO, it's a little confusing to have multiple ways of handling multi-dictionary inputs within the engine. MultiVariantWalker seems to just use the first vcf's dictionary which also seems arbitrary and contains some costly code to check the dictionaries for consistency.

Maybe we should provide a way for subclasses to choose what kind of dictionary check is performed? That way you have both options for both MultiVariantWalker and FeatureMergingWalker. Perhaps the engine folks should weigh in here though.

src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java

mwalker174 · 2022-03-01T21:13:49Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java

+ * To use this walker you need only implement the abstract apply method in a class that declares
+ * a list of FeatureInputs as an argument.
+ */
+public abstract class FeatureMergingWalker<F extends Feature> extends WalkerBase {


How about MultiFeatureWalker? Akin to MultiVariantWalker?

Are you referring to the name? Seems like a fine suggestion to rename it MultiFeatureWalker, now that I understand MultiVariantWalker better. However, I'd rather not get bogged down in refactoring the MultiVariantWalker for this PR unless you or the engine team think it's important to do so.

Yes here I was just suggesting a name change. I'll also leave it up to the engine team to decide whether they should actually be sharing code.

src/main/java/org/broadinstitute/hellbender/tools/sv/MergeSVEvidence.java

src/main/java/org/broadinstitute/hellbender/tools/sv/DepthEvidence.java

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java

src/main/java/org/broadinstitute/hellbender/tools/sv/PrintSVEvidence.java

src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java

tedsharpe · 2022-03-10T21:38:07Z

I think this is ready for another PR review.

gatk-bot · 2022-03-10T22:11:59Z

Travis reported job failures from build 38071
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	openjdk11	38071.13	logs
unit	openjdk8	38071.3	logs

mwalker174

Thanks @tedsharpe I think this is almost ready! For the FeatureOutputCodec, I think we should err on the side of caution here and do some extra checks to make sure inputs are as expected, even if there is some performance hit. I also have some suggestions on how to refactor a bit to make some of its functionality more re-usable for future tools.

Other than that, there are just some places that would benefit from some documentation, including the tool itself.

mwalker174 · 2022-03-17T20:02:35Z

src/main/java/org/broadinstitute/hellbender/utils/codecs/FeatureOutputCodec.java

+    Comparator<F> getSameLocusComparator();
+    void resolveSameLocusFeatures( PriorityQueue<F> queue, S sink );


Since they're public, can you make these methods safer to use in your implementations? I'd worry about bugs where someone (i.e. me) hands it features from different loci. Throw an error if that happens.

I do think that this functionality could be useful in future tools as well - being able to consume lists of feature files and merge them on the fly could come in handy. Can you move these to SVFeature instead? To be more general-purpose, consume a Collection<SVFeature> input and have an SVFeature output? Seems like most of the implementations are in the feature classes anyway.

It would be nice to have SVFeature or the evidence classes themselves implement comparable more generally, and enforce stable ordering in the output, even if there's a small performance hit.

Moved resolution of same-locus features into a general post-processing companion class (which would be available for other situations). One could hook these up as part of a streaming operation, I think, which would allow your suggested use case.
I really wish that features could implement Comparable. Alas, they require a dictionary to compare. This could be encapsulated in a Comparable, but then you'd have to pass the comparable around (instead of passing a dictionary around), and so you're no further ahead.

mwalker174 · 2022-03-17T20:07:33Z

src/main/java/org/broadinstitute/hellbender/tools/sv/PrintSVEvidence.java

- * </pre>
- *
- * @author Mark Walker &lt;markw@broadinstitute.org&gt;
- */


Going to need tool documentation here - you can base if off the old documentation. Make sure to enumerate the different constraints on each evidence type, and expected behavior for special cases such as filling in "missing values" for depth evidence.

mwalker174 · 2022-03-18T16:28:21Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java

+        LocusDepth lastEvidence = queue.poll();
+        while ( !queue.isEmpty() ) {
+            final LocusDepth evidence = queue.poll();
+            if ( comparator.compare(lastEvidence, evidence) == 0 ) {


With this approach I would also worry about bugs where the order is not what we expect - i.e. queue is not built with the Feature's comparator. Can you make this throw an error if this is the case by checking queue.comparator() (and for the other features as well)?

mwalker174 · 2022-03-18T16:36:45Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java

    private final String contig;
    private final int position;
+    private final String sample;
    private final byte refCall; // index into nucleotideValues


It just occurred to me that we probably don't need to know the reference base for SV calling. Can always be looked up anyhow.

mwalker174 · 2022-03-18T16:40:08Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+     */
+    public Set<String> getSampleNames() { return samples; }
+
+    private void setDictionaryAndSamples() {


Can you add some documentation with an overview of the logic here?

mwalker174 · 2022-03-18T16:40:48Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+        return largeDict;
+    }
+
+    public static final class MergingIterator<F extends Feature> implements Iterator<PQEntry<F>> {


Also a quick doc here

mwalker174 · 2022-03-18T16:43:36Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVFeaturesHeader.java

 import java.util.List;

-public class FeaturesHeader {
+public class SVFeaturesHeader {


Ok I'm fine with it if it's only used for SV classes

mwalker174 · 2022-03-18T16:47:27Z

src/main/java/org/broadinstitute/hellbender/utils/codecs/FeatureOutputCodecFinder.java

+import java.util.ArrayList;
+import java.util.List;
+
+public final class FeatureOutputCodecFinder {


I think a quick comment here explaining why this is needed would be helpful

mwalker174 · 2022-03-18T16:54:48Z

src/main/java/org/broadinstitute/hellbender/tools/sv/DepthEvidence.java

+                final int count = tmpCounts[idx];
+                if ( count != MISSING_DATA ) {
+                    if ( evCounts[idx] == MISSING_DATA ) {
+                        evCounts[idx] = tmpCounts[idx];


Kind of neat that we can "patch" missing regions if needed. Make sure to document this here and in the tool doc.

mwalker174 · 2022-03-18T16:58:34Z

src/main/java/org/broadinstitute/hellbender/tools/sv/DiscordantPairEvidence.java

+    public final static Comparator<DiscordantPairEvidence> comparator =
+            Comparator.comparing(DiscordantPairEvidence::getSample);


This could always return 0? Unless there's an important reason to sort the output by sample name?

Also I think if you still need these comparators after implementing Comparable, they should be private and renamed to something like sameLocusComparator

mwalker174 · 2022-03-24T15:51:50Z

Thank you @tedsharpe for addressing my comments, I just have two additional requests:

The tool still needs a Doc comment header (eg see SelectVariants example) to generate doc for the GATK Tool Index. It can be minimal with just "Inputs", "Outputs", and "Usage examples" sections.
It would be nice to cut down on repeated code in the EvidenceSortMerger classes - it seems like the methods are all nearly the same, with the exception of the Comparator definition and the inner loop code in resolveSameLocusFeatures(). Could you move the common code up to an abstract SVEvidenceSortMerger class?

mwalker174

Thank you! Looks good to me. @droazen did you want to review as well?

droazen

@tedsharpe I've posted a review on the changes to engine-level classes -- a couple of requests but nothing major.

droazen · 2022-04-01T17:35:28Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+     * Operations performed just prior to the start of traversal.
+     */
+    @Override
+    public void onTraversalStart() {


onTraversalStart() should generally be reserved for tool authors to override to perform whatever initialization their tool needs. Initialization for Walker classes themselves should be done by overriding onStartup(), making it final, and calling super.onStartup(); as the first line, as seen in, eg., FeatureWalker.

One of these days we really need to rename onStartup() to something like initializeTraversal() or initializeEngine() to make this distinction clearer.

droazen · 2022-04-01T17:38:06Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+ * To use this walker you need only implement the abstract apply method in a class that declares
+ * a collection of FeatureInputs as an argument.
+ */
+public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase {


Can you add an ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples + an ExampleMultiFeatureWalkerIntegrationTest, modeled after the existing ExampleFeatureWalker and ExampleFeatureWalkerIntegrationTest? We generally try to do this for each new traversal added to GATK both as examples for tool authors and to catch regressions in the Walker classes.

droazen · 2022-04-01T17:40:45Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+ * multiple sources of Features.  The input files for each feature must be sorted by locus.
+ *
+ * To use this walker you need only implement the abstract apply method in a class that declares
+ * a collection of FeatureInputs as an argument.


If you take my suggestion below to override onStartup() instead of onTraversalStart(), add to the docs here:

and may optionally implement {@link #onTraversalStart()}, {@link #onTraversalSuccess()}, and/or {@link #closeTool()}.

which is the usual pattern for Walker classes

droazen · 2022-04-01T17:46:37Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+     * @param feature Current Feature being processed.
+     * @param header Header object for the source from which the feature was drawn (may be null)
+     */
+    public abstract void apply( final F feature, final Object header );


Is there any potential use case for having overlapping reference bases and/or reads as a side input here? If that's outside the scope of this traversal then that's fine, but if it could be potentially useful for future tools consider adding a ReferenceContext and/or ReadsContext to apply(), as seen in FeatureWalker and most of the other Walker classes.

Add ReadsContext and ReferenceContext to apply method? OK, I've done so.
[Editorializing of the worst sort: I've done so because consistency seems more important right now than design, but it bugs me. It seems crazy to pass dummy objects that might not even be useful (when unbacked) when we don't know whether we even need them. Seems to me that it would've been much smarter to provide a walker callback method that the apply methods could use if they need this info. Please put it on your list of design annoyances that we'll never have time to fix. Or maybe there's a good justification that has simply escaped me.]

@tedsharpe GATK has always been structured around universal arguments like -R and -I that all tools accept, and that (when provided) automatically populate contextual objects passed in to the tools. The main justification for this design is to not require any extra work for the tool authors when overlapping data from other sources is required, and to have the most common kinds of inputs "wired up" in advance and available for the tool to use as needed. Tools can call the hasReference(), etc., methods from GATKTool, or the hasBackingDataSource() methods on the context objects themselves if they need to query whether a particular kind of contextual data is available.

Yeah. I get that. IMHO, it's not really extra work to get the objects you need when you actually need them, and that's better than preparing dummy objects that you might not even want or use that get passed to you every time. I know that's how it's always worked, but it seems silly to me. I revised the code to do things the standard way.

It's a little more discoverable / self-documenting this way -- a reminder that the engine can provide you with all these side inputs if you need them. Plus there is no overhead to having these contextual objects around: they are all implemented in terms of lazy queries that don't happen until you access them.

Nope. Comes from the Ministry of Silly Walkers. That's my story, and I'm stickin' to it.

Hah, well, it may not convince you, but here's one last defense: the context objects are all initialized with the interval of the primary record -- since they are "tied" to the primary record in that sense, it makes sense that they should be passed in during the traversal along with the primary record.

droazen · 2022-04-01T17:52:11Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+        }
+    }
+
+    private SAMSequenceDictionary bestDictionary( final SAMSequenceDictionary newDict,


Add a method comment documenting the approach taken for selecting the "best" dictionary

Approach for "best" dictionary was documented in setDictionaryAndSamples, but I've rejiggered it to make it less easily missed. (The bestDictionary method is now a static betterDictionary method, which is what was actually going on.)

droazen · 2022-04-01T18:18:53Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureDataSource.java

        final Object header = getHeader();
-        if ( header instanceof FeaturesHeader ) {
-            dict = ((FeaturesHeader)header).getDictionary();
+        if ( header instanceof SVFeaturesHeader ) {


What's the rationale behind the FeaturesHeader -> SVFeaturesHeader migration?

Why FeaturesHeader --> SVFeaturesHeader? It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.

droazen · 2022-04-01T18:22:28Z

src/test/java/org/broadinstitute/hellbender/engine/MultiFeatureWalkerUnitTest.java

+import java.util.ArrayList;
+import java.util.List;
+
+public class MultiFeatureWalkerUnitTest extends CommandLineProgramTest {


Ah, I think this can become the ExampleMultiFeatureWalkerIntegrationTest I requested above if you promote your DummyMultiFeatureWalker into an official ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples

droazen · 2022-04-01T18:23:44Z

src/test/java/org/broadinstitute/hellbender/engine/MultiFeatureWalkerUnitTest.java

+                "-" + StandardArgumentDefinitions.INPUT_SHORT_NAME, largeFileTestDir + "NA12878.alignedHg38.duplicateMarked.baseRealigned.bam",
+                // no dictionary, no sample names, with a single feature
+                "-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt"
+        };


Consider using ArgumentsBuilder to build the command lines for your test cases.

droazen · 2022-04-01T18:25:33Z

src/test/java/org/broadinstitute/hellbender/engine/MultiFeatureWalkerUnitTest.java

+                // no dictionary, no sample names, with a single feature
+                "-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt"
+        };
+        dummy.instanceMain(args);


Use runCommandLine(args) after this becomes ExampleMultiFeatureWalkerIntegrationTest

droazen · 2022-04-01T18:26:24Z

src/test/java/org/broadinstitute/hellbender/engine/MultiFeatureWalkerUnitTest.java

+            lastStart = feature.getStart();
+        }
+    }
+}


Need at least one test case that uses -L <intervals>

tedsharpe · 2022-04-04T21:24:27Z

Remainder of review comments addressed directly with no guff.

gatk-bot · 2022-04-04T21:38:32Z

Travis reported job failures from build 38460
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	38460.1	logs
cloud	openjdk11	38460.14	logs
unit	openjdk11	38460.13	logs

tedsharpe · 2022-04-05T16:46:16Z

OK. Passes checks. @droazen One final look? I couldn't figure out how to use runCommandLine, because I needed access to my walker instance. Is this OK, or is there a better way to do it?

tedsharpe · 2022-04-06T16:57:06Z

@droazen Can I get a thumbs up on this if I've addressed your suggestions adequately, or further input if not? Thanks.

droazen · 2022-04-06T16:57:57Z

@tedsharpe I'll take a look now!

droazen

@tedsharpe Engine changes look good now -- a few last easy comments, then go ahead and merge once they're addressed and tests pass (assuming of course that the changes in the sv package, which I didn't look at, have been fully reviewed)

droazen · 2022-04-06T17:23:37Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+            }
+            if ( newIdx <= lastIdx ) {
+                throw new UserException("Contig " + rec.getContig() +
+                                        " not in same order as in larger dictionary");


Since these are exceptions that actual users are likely to trigger, it would be helpful to include the contents of both dictionaries in the error messages. You can do this by throwing a UserException.IncompatibleSequenceDictionaries and passing the two dictionaries in.

droazen · 2022-04-06T17:26:07Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase {
+
+    SAMSequenceDictionary dictionary;
+    final Set<String> samples = new TreeSet<>();


Since there are accessor methods for these, can they be private? And relatedly, should the Set of samples be wrapped in a Collections.unmodifiableSet() after initialization to prevent downstream modification?

droazen · 2022-04-06T17:26:53Z

src/main/java/org/broadinstitute/hellbender/engine/MultiFeatureWalker.java

+     * internal state as possible.
+     *
+     * @param feature Current Feature being processed.
+     * @param header Header object for the source from which the feature was drawn (may be null)


Add javadoc for the all-important ReadsContext and ReferenceContext args

droazen · 2022-04-06T17:36:40Z

src/main/java/org/broadinstitute/hellbender/tools/examples/ExampleMultiFeatureWalker.java

+                                 final ReadsContext readsContext,
+                                 final ReferenceContext referenceContext ) {
+        // We'll just keep track of the Features we see, in the order that we see them.
+        features.add(feature);


This is a little inconsistent with the rest of the example walkers, which all produce some diagnostic output showing the records that are processed by the traversal. Could you have this tool produce some textual output along the lines of ExampleFeatureWalker in addition to accumulating the Features in memory?

mwalker174 self-assigned this Feb 25, 2022

mwalker174 requested changes Mar 1, 2022

View reviewed changes

mwalker174 mentioned this pull request Mar 2, 2022

BAF from BAM broadinstitute/gatk-sv#18

Closed

droazen self-requested a review March 2, 2022 19:53

droazen self-assigned this Mar 2, 2022

mwalker174 requested changes Mar 18, 2022

View reviewed changes

mwalker174 approved these changes Mar 25, 2022

View reviewed changes

droazen suggested changes Apr 1, 2022

View reviewed changes

droazen approved these changes Apr 6, 2022

View reviewed changes

MultiFeatureWalker + new PrintSVEvidence

0bf2c44

tedsharpe force-pushed the tws_MergeSVEvidence branch from 5ae634f to 0bf2c44 Compare April 7, 2022 17:40

tedsharpe merged commit dc35e76 into master Apr 7, 2022

tedsharpe deleted the tws_MergeSVEvidence branch April 7, 2022 20:13

		Comparator<F> getSameLocusComparator();
		void resolveSameLocusFeatures( PriorityQueue<F> queue, S sink );

		public final static Comparator<DiscordantPairEvidence> comparator =
		Comparator.comparing(DiscordantPairEvidence::getSample);

Conversation

tedsharpe commented Feb 23, 2022

Uh oh!

mwalker174 commented Mar 1, 2022

Uh oh!

mwalker174 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tedsharpe commented Mar 10, 2022

Uh oh!

gatk-bot commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwalker174 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwalker174 commented Mar 24, 2022

Uh oh!

mwalker174 left a comment

Choose a reason for hiding this comment

Uh oh!

droazen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwalker174 left a comment •

edited

Loading

gatk-bot commented Mar 10, 2022 •

edited

Loading