1. Context
Twice during my career, I was given the project to compare two large CSV files (Comma Separated Values: text files corresponding to either an Excel file export or an export of database tables):
- In the first case, it was a comparison between trades entered in a customized software package in an independent country P1, and those entered in the same software package customized differently in a P2 country. It was necessary to primarily transcode the information then to recognize those who were absent from the Paris system and those who had been modified in order to update the data of the Paris database. Each file was about 300Mb and the computers had about 1Gb of RAM. The development os expected to be done in the next 3 days maximum. (The principle: we want the result for yesterday with a run time inferior to 8 hours).
- In the second case, it was necessary to reconcile flow files between two different software packages (the old one which had to be decommissioned and the replacement one) that ran in competition in order to make non-regressions. Each file was 2.5Gb and the goal was to compare them daily in less than 2 hours. Development needed to be operational in less than two weeks.
To be able to compare them, the first step (and the most complex step) consists of sorting the 2 files: by sorting them we can decrease the number of lines read as well as the number of comparison. Unfortunately, sorting the file requires loading it, and when it is large it becomes complicated.
2. Technical choices
2.1 Language
The choice of language strongly depends on the context:
- If you have sufficient development time, use a compiled language like C / C ++ or just in time compiler like C# that will provide you with excellent performance and fine memory management.
- If your need requires a short development time you have to choose a scripting language such as Perl or Python.
For example, in this post, we will use the Python language in order to obtain the most compact and clear code possible, allowing you to appropriate the method and then quickly transpose it into another language.
2.2 Algorithm
The choice of the algorithm will condition the performance as well as the feasibility. Indeed “loading a CSV file” of 2.5Gb in RAM is not possible unless you have an exceptional memory capacity prohibiting the storage of the file in memory. Give it multiple attempts and experiment and you will find that the memory consumption is almost 10 times greater as we cut lines in vector of values.
Here is a small example:
def read_csv( filename ): results = [] with open( filename, 'r' ) as file: for line in file: line = line.rstrip() columns = line.split( ';' ) results.append( columns ) return results path = "C:\\personnel\\python\\blog\\compare_csv" before = psutil.virtual_memory() results = read_csv( path + "\\sample1.csv" ) after = psutil.virtual_memory() print( "psutil.size =", after.used - before.used )
If we use a csv file of 39Mo, we get 341Mo in memory.
The solution is to use an external merge sorting: we divide the large file into several files of the same size that we sort independently from each other and rebuild the file by merging the sorted files. The details of the method can be found on these following sites:
- https://fr.wikipedia.org/wiki/Tri_fusion
- http://www.montefiore.ulg.ac.be/~pw/cours/psfiles/fichiers-transp9.pdf
Divide to conquer
The first part of the algorithm is to divide the main file into sub-files of equivalent size (we will rely on a number of lines of the original file) that are sorted independently of each other. We then cut the filename. In the temporary_path directory, we place the cut files. Each file will contain rows_count rows and will be sorted using the ‘sort’ function of the lists and then pass the function get_key (this function extracts from the initial file columns a tuple serving as a sort key) in the key argument.
def split_file( filename, temporary_radix, rows_count, get_key ): """ Split file Split input file in several sorted files of rows_count lines. Parameters: filename: input full path name temporary_radix: full path radix to create splitted sorted files get_key: the function to extract sorting key from columns Return: number of generated files """ # local variables file_counter = 0 # write sub file def write_file( filename, rows ): rows.sort( key = get_key ) with open( filename, 'w' ) as output_file: for r in rows: print( *r, sep = ';', end = '\n', file = output_file ) # loop with open( filename, "r" ) as input_file: rows = [] for line in input_file: line = line.rstrip() columns = line.split( ';' ) rows.append( columns ) if len( rows ) == rows_count: write_file( temporary_radix % ( 0, file_counter ), rows ) rows.clear() file_counter += 1 if len( rows ) > 0: write_file( temporary_radix % ( 0, file_counter ), rows ) file_counter += 1 return file_counter
Merging in order to win
The second part of the algorithm consists in merging the 2 by 2 files, making sure it keeps sorting. We will read one line at a time in each file, compare them and write the smallest one in the result file and replace the line written by the next one of the same original file. The risk is famine (we have consumed all the lines of one of the two files but not in the other), it will then be required to evolve the comparison function to be able to handle empty lines: we shift the line (well defined) to the output file.
def fusion( temporary_radix, depth, id1, id2, new_id, get_key ): """ File fusion F(id1) + F(id2) -> F(new_id) Parameters: temporary_radix: full path radix to read/write temporary files depth: depth id1: id of 1st file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth, id1 )) id2: id of 2nd file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth, id2 )) new_id: id of the new file ("%s\\temp_%d_%d.csv" % ( temporary_path, depth+1, new_id )) get_key: function to extract the columns from the line """ # local variables columns1 = None columns2 = None name1 = temporary_radix % ( depth, id1 ) name2 = temporary_radix % ( depth, id2 ) name_o = temporary_radix % ( depth + 1, new_id ) # compare def compare(): if None == columns1: return 1 elif None == columns2 or get_key( columns1 ) < get_key( columns2 ): return -1 else: return 1 # open the files with open( name1, "r" ) as inFile1, \ open( name2, "r" ) as inFile2, \ open( name_o, "w" ) as outFile: while True: # read lines if not columns1: line = inFile1.readline().rstrip() if 0 != len( line ): columns1 = line.split( ';' ) if not columns2: line = inFile2.readline().rstrip() if 0 != len( line ): columns2 = line.split( ';' ) if not columns1 and not columns2: break # compare if compare() < 0: print( *columns1, sep = ';', end = '\n', file = outFile ) columns1 = None else: print( *columns2, sep = ';', end = '\n', file = outFile ) columns2 = None # finalize unlink( name1 ) unlink( name2 )
5. Sorting or Nothing
The purpose of this function is to organize the cut and then merge until you get a sorted file. Once cut, we will merge the files 2 by 2: at each step we divide the number of files by 2. We must continue until there is more than one file. And in the case where at one step there is an odd number of files, there will be a file to copy from step i to i + 1.
def sort_file( input_filename, \ sorted_filename, \ rows_count, \ temporary_radix = "C:\\Temp\\temp_%d_%d.csv", \ get_key = lambda c: c ): """Sort file Parameters: input_filename: input file name (file to sort) sorted_filename: output file name (sorted file) temporary_radix: full path radix name for temporary files ({path}\\temp_{depth}_{id}.csv) rows_count: number of rows per file in splitting phasis get_key: key function to sort """ # initialize count = split_file( input_filename, temporary_radix, rows_count, get_key ) depth = 0 # main loop: keep 2 files, join them into one sorted file until all files are consumed while 1 < count: n = 0 for i in range( 0, count, 2 ): j = i + 1 if count == j: rename( temporary_radix % ( depth, i ), \ temporary_radix % ( depth + 1, n ) ) else: fusion( temporary_radix, depth, i, j, n, get_key ) n += 1 depth += 1 count = n # rename temporary file into attempted file rename( temporary_radix % ( depth, 0 ), sorted_filename )
Here is an example :
get_key = lambda c: (c[1],) sort_file( "C:\\personnel\\python\\blog\\compare_csv\\sample1.csv", \ "C:\\personnel\\python\\blog\\compare_csv\\sample1_sorted.csv", \ 10000, \ "C:\\personnel\\python\\blog\\compare_csv\\temp\\temp_%d_%d.csv", \ get_key )
Conclusion
The method makes it possible to replace memory allocation / management problems by file accesses. The performance will therefore depend on 2 parameters:
- Disk access time: avoid using a network drive to manage temporary files, prefer a local hard drive.
- The number of lines when cutting: The lower the number, the more the files are important and therefore the more the disk access will be expensive.
It is necessary to calibrate the algorithm according to the machine used (amount of memory and hard disk) to execute the operation and the format of the CSV file. This will make an efficient choice of the number of lines for cutting.