Conversation
|
@droazen May want someone from engine team to look at this since it touches some core classes. |
There was a problem hiding this comment.
Thanks @tedsharpe this looks good so far. You've done the hard part of finagling the evidence feature classes into a nice framework for this. I've taken a first pass, excluding tests, and I just had two main comments:
- The tool currently just merges records in dictionary order and subsets samples when necessary. I probably wasn't clear on everything the tool should do, but I have a comment below outlining the cases when Features actually need to be merged as well (e.g. RD records at the same interval with different samples). It would also be nice to have some kind of check for collisions - for example the same sample and interval defined in two of the input files is probably unexpected and should result in an error.
- It seems like there's a lot of overlap with the
MultiVariantWalkerandMultiVariantDataSourceclasses. IMO,VariantContextimplementsFeatureand therefore it seems redundant to have completely separate classes forFeatures as well. Is there a way to consolidate the code such thatMultiVariantWalkerextendsFeatureMergingWalkerand/or defining aMultiFeatureDataSourceextended byMultiVariantDataSource? @droazen any thoughts?
| import java.util.List; | ||
|
|
||
| public class FeaturesHeader { | ||
| public class SVFeaturesHeader { |
There was a problem hiding this comment.
Not sure why you needed to move and rename this, although I can see the class isn't used anywhere currently. Maybe it will make more sense as I go along.
There was a problem hiding this comment.
It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
But there's no particularly compelling reason to change it, and it can be reverted.
There was a problem hiding this comment.
Ok I'm fine with it if it's only used for SV classes
| public Set<String> getSampleNames() { return samples; } | ||
|
|
||
| private void setDictionaryAndSamples() { | ||
| dictionary = getMasterSequenceDictionary(); |
There was a problem hiding this comment.
Why not use getBestAvailableSequenceDictionary() instead? Rather than defining special logic here (although you would need to modify it to look at the features headers)
There was a problem hiding this comment.
That method pukes if there are multiple dictionaries -- even if they're equivalent. And trying to scrape an incomplete dictionary from the index compounded the problem.
This method allows multiple dictionaries if they're all subsets (with respect to name and order) of the largest dictionary.
There was a problem hiding this comment.
I see, I think a comment for this function explaining that logic would be helpful. IMO, it's a little confusing to have multiple ways of handling multi-dictionary inputs within the engine. MultiVariantWalker seems to just use the first vcf's dictionary which also seems arbitrary and contains some costly code to check the dictionaries for consistency.
Maybe we should provide a way for subclasses to choose what kind of dictionary check is performed? That way you have both options for both MultiVariantWalker and FeatureMergingWalker. Perhaps the engine folks should weigh in here though.
src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java
Outdated
Show resolved
Hide resolved
| * To use this walker you need only implement the abstract apply method in a class that declares | ||
| * a list of FeatureInputs as an argument. | ||
| */ | ||
| public abstract class FeatureMergingWalker<F extends Feature> extends WalkerBase { |
There was a problem hiding this comment.
How about MultiFeatureWalker? Akin to MultiVariantWalker?
There was a problem hiding this comment.
Are you referring to the name? Seems like a fine suggestion to rename it MultiFeatureWalker, now that I understand MultiVariantWalker better. However, I'd rather not get bogged down in refactoring the MultiVariantWalker for this PR unless you or the engine team think it's important to do so.
There was a problem hiding this comment.
Yes here I was just suggesting a name change. I'll also leave it up to the engine team to decide whether they should actually be sharing code.
src/main/java/org/broadinstitute/hellbender/tools/sv/MergeSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/MergeSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/DepthEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/PrintSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java
Outdated
Show resolved
Hide resolved
|
I think this is ready for another PR review. |
mwalker174
left a comment
There was a problem hiding this comment.
Thanks @tedsharpe I think this is almost ready! For the FeatureOutputCodec, I think we should err on the side of caution here and do some extra checks to make sure inputs are as expected, even if there is some performance hit. I also have some suggestions on how to refactor a bit to make some of its functionality more re-usable for future tools.
Other than that, there are just some places that would benefit from some documentation, including the tool itself.
| Comparator<F> getSameLocusComparator(); | ||
| void resolveSameLocusFeatures( PriorityQueue<F> queue, S sink ); |
There was a problem hiding this comment.
Since they're public, can you make these methods safer to use in your implementations? I'd worry about bugs where someone (i.e. me) hands it features from different loci. Throw an error if that happens.
I do think that this functionality could be useful in future tools as well - being able to consume lists of feature files and merge them on the fly could come in handy. Can you move these to SVFeature instead? To be more general-purpose, consume a Collection<SVFeature> input and have an SVFeature output? Seems like most of the implementations are in the feature classes anyway.
It would be nice to have SVFeature or the evidence classes themselves implement comparable more generally, and enforce stable ordering in the output, even if there's a small performance hit.
There was a problem hiding this comment.
Moved resolution of same-locus features into a general post-processing companion class (which would be available for other situations). One could hook these up as part of a streaming operation, I think, which would allow your suggested use case.
I really wish that features could implement Comparable. Alas, they require a dictionary to compare. This could be encapsulated in a Comparable, but then you'd have to pass the comparable around (instead of passing a dictionary around), and so you're no further ahead.
| * </pre> | ||
| * | ||
| * @author Mark Walker <markw@broadinstitute.org> | ||
| */ |
There was a problem hiding this comment.
Going to need tool documentation here - you can base if off the old documentation. Make sure to enumerate the different constraints on each evidence type, and expected behavior for special cases such as filling in "missing values" for depth evidence.
| LocusDepth lastEvidence = queue.poll(); | ||
| while ( !queue.isEmpty() ) { | ||
| final LocusDepth evidence = queue.poll(); | ||
| if ( comparator.compare(lastEvidence, evidence) == 0 ) { |
There was a problem hiding this comment.
With this approach I would also worry about bugs where the order is not what we expect - i.e. queue is not built with the Feature's comparator. Can you make this throw an error if this is the case by checking queue.comparator() (and for the other features as well)?
| private final String contig; | ||
| private final int position; | ||
| private final String sample; | ||
| private final byte refCall; // index into nucleotideValues |
There was a problem hiding this comment.
It just occurred to me that we probably don't need to know the reference base for SV calling. Can always be looked up anyhow.
| */ | ||
| public Set<String> getSampleNames() { return samples; } | ||
|
|
||
| private void setDictionaryAndSamples() { |
There was a problem hiding this comment.
Can you add some documentation with an overview of the logic here?
| return largeDict; | ||
| } | ||
|
|
||
| public static final class MergingIterator<F extends Feature> implements Iterator<PQEntry<F>> { |
| import java.util.List; | ||
|
|
||
| public class FeaturesHeader { | ||
| public class SVFeaturesHeader { |
There was a problem hiding this comment.
Ok I'm fine with it if it's only used for SV classes
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| public final class FeatureOutputCodecFinder { |
There was a problem hiding this comment.
I think a quick comment here explaining why this is needed would be helpful
| final int count = tmpCounts[idx]; | ||
| if ( count != MISSING_DATA ) { | ||
| if ( evCounts[idx] == MISSING_DATA ) { | ||
| evCounts[idx] = tmpCounts[idx]; |
There was a problem hiding this comment.
Kind of neat that we can "patch" missing regions if needed. Make sure to document this here and in the tool doc.
| public final static Comparator<DiscordantPairEvidence> comparator = | ||
| Comparator.comparing(DiscordantPairEvidence::getSample); |
There was a problem hiding this comment.
This could always return 0? Unless there's an important reason to sort the output by sample name?
There was a problem hiding this comment.
Also I think if you still need these comparators after implementing Comparable, they should be private and renamed to something like sameLocusComparator
|
Thank you @tedsharpe for addressing my comments, I just have two additional requests:
|
mwalker174
left a comment
There was a problem hiding this comment.
Thank you! Looks good to me. @droazen did you want to review as well?
droazen
left a comment
There was a problem hiding this comment.
@tedsharpe I've posted a review on the changes to engine-level classes -- a couple of requests but nothing major.
| * Operations performed just prior to the start of traversal. | ||
| */ | ||
| @Override | ||
| public void onTraversalStart() { |
There was a problem hiding this comment.
onTraversalStart() should generally be reserved for tool authors to override to perform whatever initialization their tool needs. Initialization for Walker classes themselves should be done by overriding onStartup(), making it final, and calling super.onStartup(); as the first line, as seen in, eg., FeatureWalker.
One of these days we really need to rename onStartup() to something like initializeTraversal() or initializeEngine() to make this distinction clearer.
| * To use this walker you need only implement the abstract apply method in a class that declares | ||
| * a collection of FeatureInputs as an argument. | ||
| */ | ||
| public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase { |
There was a problem hiding this comment.
Can you add an ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples + an ExampleMultiFeatureWalkerIntegrationTest, modeled after the existing ExampleFeatureWalker and ExampleFeatureWalkerIntegrationTest? We generally try to do this for each new traversal added to GATK both as examples for tool authors and to catch regressions in the Walker classes.
| * multiple sources of Features. The input files for each feature must be sorted by locus. | ||
| * | ||
| * To use this walker you need only implement the abstract apply method in a class that declares | ||
| * a collection of FeatureInputs as an argument. |
There was a problem hiding this comment.
If you take my suggestion below to override onStartup() instead of onTraversalStart(), add to the docs here:
and may optionally implement {@link #onTraversalStart()}, {@link #onTraversalSuccess()}, and/or {@link #closeTool()}.
which is the usual pattern for Walker classes
| * @param feature Current Feature being processed. | ||
| * @param header Header object for the source from which the feature was drawn (may be null) | ||
| */ | ||
| public abstract void apply( final F feature, final Object header ); |
There was a problem hiding this comment.
Is there any potential use case for having overlapping reference bases and/or reads as a side input here? If that's outside the scope of this traversal then that's fine, but if it could be potentially useful for future tools consider adding a ReferenceContext and/or ReadsContext to apply(), as seen in FeatureWalker and most of the other Walker classes.
There was a problem hiding this comment.
Add ReadsContext and ReferenceContext to apply method? OK, I've done so.
[Editorializing of the worst sort: I've done so because consistency seems more important right now than design, but it bugs me. It seems crazy to pass dummy objects that might not even be useful (when unbacked) when we don't know whether we even need them. Seems to me that it would've been much smarter to provide a walker callback method that the apply methods could use if they need this info. Please put it on your list of design annoyances that we'll never have time to fix. Or maybe there's a good justification that has simply escaped me.]
There was a problem hiding this comment.
@tedsharpe GATK has always been structured around universal arguments like -R and -I that all tools accept, and that (when provided) automatically populate contextual objects passed in to the tools. The main justification for this design is to not require any extra work for the tool authors when overlapping data from other sources is required, and to have the most common kinds of inputs "wired up" in advance and available for the tool to use as needed. Tools can call the hasReference(), etc., methods from GATKTool, or the hasBackingDataSource() methods on the context objects themselves if they need to query whether a particular kind of contextual data is available.
There was a problem hiding this comment.
Yeah. I get that. IMHO, it's not really extra work to get the objects you need when you actually need them, and that's better than preparing dummy objects that you might not even want or use that get passed to you every time. I know that's how it's always worked, but it seems silly to me. I revised the code to do things the standard way.
There was a problem hiding this comment.
It's a little more discoverable / self-documenting this way -- a reminder that the engine can provide you with all these side inputs if you need them. Plus there is no overhead to having these contextual objects around: they are all implemented in terms of lazy queries that don't happen until you access them.
There was a problem hiding this comment.
Nope. Comes from the Ministry of Silly Walkers. That's my story, and I'm stickin' to it.
There was a problem hiding this comment.
Hah, well, it may not convince you, but here's one last defense: the context objects are all initialized with the interval of the primary record -- since they are "tied" to the primary record in that sense, it makes sense that they should be passed in during the traversal along with the primary record.
| } | ||
| } | ||
|
|
||
| private SAMSequenceDictionary bestDictionary( final SAMSequenceDictionary newDict, |
There was a problem hiding this comment.
Add a method comment documenting the approach taken for selecting the "best" dictionary
There was a problem hiding this comment.
Approach for "best" dictionary was documented in setDictionaryAndSamples, but I've rejiggered it to make it less easily missed. (The bestDictionary method is now a static betterDictionary method, which is what was actually going on.)
| final Object header = getHeader(); | ||
| if ( header instanceof FeaturesHeader ) { | ||
| dict = ((FeaturesHeader)header).getDictionary(); | ||
| if ( header instanceof SVFeaturesHeader ) { |
There was a problem hiding this comment.
What's the rationale behind the FeaturesHeader -> SVFeaturesHeader migration?
There was a problem hiding this comment.
Why FeaturesHeader --> SVFeaturesHeader? It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| public class MultiFeatureWalkerUnitTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Ah, I think this can become the ExampleMultiFeatureWalkerIntegrationTest I requested above if you promote your DummyMultiFeatureWalker into an official ExampleMultiFeatureWalker in org.broadinstitute.hellbender.tools.examples
| "-" + StandardArgumentDefinitions.INPUT_SHORT_NAME, largeFileTestDir + "NA12878.alignedHg38.duplicateMarked.baseRealigned.bam", | ||
| // no dictionary, no sample names, with a single feature | ||
| "-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt" | ||
| }; |
There was a problem hiding this comment.
Consider using ArgumentsBuilder to build the command lines for your test cases.
| // no dictionary, no sample names, with a single feature | ||
| "-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt" | ||
| }; | ||
| dummy.instanceMain(args); |
There was a problem hiding this comment.
Use runCommandLine(args) after this becomes ExampleMultiFeatureWalkerIntegrationTest
| lastStart = feature.getStart(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Need at least one test case that uses -L <intervals>
|
Remainder of review comments addressed directly with no guff. |
|
OK. Passes checks. @droazen One final look? I couldn't figure out how to use runCommandLine, because I needed access to my walker instance. Is this OK, or is there a better way to do it? |
|
@droazen Can I get a thumbs up on this if I've addressed your suggestions adequately, or further input if not? Thanks. |
|
@tedsharpe I'll take a look now! |
droazen
left a comment
There was a problem hiding this comment.
@tedsharpe Engine changes look good now -- a few last easy comments, then go ahead and merge once they're addressed and tests pass (assuming of course that the changes in the sv package, which I didn't look at, have been fully reviewed)
| } | ||
| if ( newIdx <= lastIdx ) { | ||
| throw new UserException("Contig " + rec.getContig() + | ||
| " not in same order as in larger dictionary"); |
There was a problem hiding this comment.
Since these are exceptions that actual users are likely to trigger, it would be helpful to include the contents of both dictionaries in the error messages. You can do this by throwing a UserException.IncompatibleSequenceDictionaries and passing the two dictionaries in.
| public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase { | ||
|
|
||
| SAMSequenceDictionary dictionary; | ||
| final Set<String> samples = new TreeSet<>(); |
There was a problem hiding this comment.
Since there are accessor methods for these, can they be private? And relatedly, should the Set of samples be wrapped in a Collections.unmodifiableSet() after initialization to prevent downstream modification?
| * internal state as possible. | ||
| * | ||
| * @param feature Current Feature being processed. | ||
| * @param header Header object for the source from which the feature was drawn (may be null) |
There was a problem hiding this comment.
Add javadoc for the all-important ReadsContext and ReferenceContext args
| final ReadsContext readsContext, | ||
| final ReferenceContext referenceContext ) { | ||
| // We'll just keep track of the Features we see, in the order that we see them. | ||
| features.add(feature); |
There was a problem hiding this comment.
This is a little inconsistent with the rest of the example walkers, which all produce some diagnostic output showing the records that are processed by the traversal. Could you have this tool produce some textual output along the lines of ExampleFeatureWalker in addition to accumulating the Features in memory?
5ae634f to
0bf2c44
Compare
No description provided.