App allows to find rows in csv file with duplicate Values by specified column
Algorithm
Concerns
file size
if file has small size => go through data and put data to hashmap to count occurrences of the target column values, output rows that we’ve met before. Time complexity O(n), Space complexity O(n)
if file has large size => content of file might be not fitted to memory => external merge sort can be used to solve this problem
nature of input data
If values of key column are unique with high probability (like guids) => hashmap-based algorithm will be storing a key for each value, thus spanning a memory for the size of the whole input file.
If contents of key column are known to be limited to some known set of possible values => then even for large files, not fitted in memory, we can use hashmap
External merge sort
Step1. Divide file into chunks fittable into memory (size can be adjusted)
Step2. Sort each chunk and save result of sorting into temp file
Sorting can be implemented using well known algorithm. I used SortedHashMap.Time complexity of insert entry to SortedHashMap will be O(log n).Extra space O(n).
Step3. Read limited piece of data from each chunk file. In my implementation each chunk has associsated queue and I read data to these queues.
Step4. Compare first values in queues. Find queue with minimum value - target value
Step5. Started from this queue push element to output while value are equal target value. When elements are finished grab new piece of data from file or move to next queue.
Step6. When all queues have no values to output (== targe value) go to step 4.