Buffer-based csv2tsv (performance improvement)#301
Merged
jondegenhardt merged 7 commits intoeBay:masterfrom Sep 6, 2020
Merged
Buffer-based csv2tsv (performance improvement)#301jondegenhardt merged 7 commits intoeBay:masterfrom
jondegenhardt merged 7 commits intoeBay:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #301 +/- ##
=======================================
Coverage 99.34% 99.35%
=======================================
Files 18 18
Lines 6763 6792 +29
=======================================
+ Hits 6719 6748 +29
Misses 44 44
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR changes the algorithm used by
csv2tsvto work on a buffer of data at a time rather than a character at a time. The motivation for this change is performance.The main change is that new version writes longer blocks of characters to the output stream, where the original version wrote a single byte at a time. The output stream itself uses buffering, but still, writing longer blocks at a time to it is faster. Also, at one point a change in either the D library or compiler resulted in less optimal code and performance degraded. The exact cause is not clear.
The new algorithm still walks over the input CSV data one byte at a time. In this manner the algorithm is similar to the original. However, rather than immediately writing the byte to the output stream, the algorithm keeps track of the set of consecutive bytes that can be written unchanged. In addition, if the byte in the input data is simply being replaced by a different byte, the modification is done in place. The common case is the CSV field delimiters being replaced, e.g. a comma being replaced by a TAB. In this way longer sequences of bytes can be written to the output stream all at once. The current input region is written out whenever a sequence of consecutive bytes is interrupted. The common case is CSV fields surrounded by double quotes, which get removed when writing the TSV form.
Performance tests indicate the new algorithm is considerably faster than the original algorithm. Testing was done on a on Mac Mini (16GB RAM, SSD drives). Compared to the current
csv2tsvversion 2.0.0, the new version ran 40% faster on files with significant amounts of CSV escapes (double quotes on every field), and 60% faster on files with limited CSV escapes. Versuscsv2tsvversion 1.1.19 (the 2018 benchmark study version), the new version is 10% faster on files with significant CSV escapes and 40% faster on files with limited CSV escapes.In short, performance is improved significantly over all previous versions. On "simple" CSV data that does not contain CSV escapes, the performance is now in the ballpark of Unix
tr, wheretris only being used to convert commas to TAB characters. GNUtris still about 20% faster, but this is a good indication the new version ofcsv2tsvhas solid performance. GNUtris of course not checking for CSV escapes and should have better overall performance.This PR also adds an option for different replacement character for TAB and Newline found in the data. The previous version used the same replacement character for both. For now this is only a change to the internal code. It will be made available from the command line in a future PR.
A copy of the previous version was put in the directory
csv2tsv/src_v1so that the original version can be found more easily.