HACKER Q&A
📣 codingclaws

How would you chunk a large Excel file?


Let's say you had an Excel file with 10,000 rows and you wanted to break it up into many Excel files each with 500 records. Each new file should have the header fields from the original. How would you do it? I did it by writing a node script but I'm wondering if there's an easier way.

Edit: Guys this is just an example. I'm looking for a general solution. It could be 10 million rows.


  👤 tanin Accepted Answer ✓
You can use my SQL spreadsheet app: https://superintendent.app

I had a similar problem at work where I needed to do some formula on a 5 GB CSV file. Excel can't handle more than 1M rows. Database through command line is too clunky. I did try to split the CSV into multiple files before but using formula on top of multiple files isn't easy. Eventually I built a Desktop GUI wrapper on SQLite, and it grew into Superintendent.app (now powered by DuckDB).

The newest version supports "Ask AI", which can be used for "Ask your CSVs anything" and "Ask AI to write/rewrite SQLs for you". It has been pretty fun to use and tell AI to "Expand *", "format all dates to something like May 3, 2024", and etc.


👤 namrog84
Write a script. I know there are decent api for excel and files.

I dont know the api or recommended scripting language. This would be a good case for chatgpt or equivalent type task. Enough to get started.

edit: I asked chatgpt, it recommended python and 'pandas' for interacting with excel

    python
    import pandas as pd
    # Load the data from an Excel file, assuming headers are in the first row by default
    data = pd.read_excel('path/to/your/file.xlsx')
    # Define the number of records per chunk
    chunk_size = 500
    # Split the data into chunks and write each chunk to a new Excel file
    for i in range(0, len(data), chunk_size):
        # Extract the chunk of data based on the current index range
        chunk = data.iloc[i:i + chunk_size]
        # Write the chunk to a new Excel file, including headers as column names
        chunk.to_excel(f'output_{i // chunk_size + 1}.xlsx', index=False)
I asked about the first 'row', and it claims panda includes that in each chunk, but I don't know about that. It's at least a place to start to iterate from. Would need to iterate further with real code/tests.

👤 orhmeh09
I would use R. Phoneposting now but something like

  library(tidyverse)
  library(readxl)
  library(writexl)
  read_excel("file.xlsx") %>%
    group_by(group_id =     row_number() %/% 20) %>%
    group_walk(~ write_xlsx(.x, paste0("file_", .y, ".xlsx")))

edit: updated to write xlsx instead of csv

👤 balnaphone
I would save my data in CSV format, then use this. Save the code below as chunk.pl (remove leading spaces) and call it as "perl chunk.pl" :

    #!/usr/bin/perl -CSD -w -Mstrict -Mwarnings -MText::CSV
    
    # chunk.pl -- split csv files into chunks
    
    # Usage message and exit if needed
    if (!@ARGV || $ARGV[0] eq '-h') {
        print "Usage: $0 input_csv [chunk_size] [output_filename_format] [separator]\n";
        print "Example: $0 input.csv 500 'input-%08d.csv' ','\n";
        exit;
    }
    
    # Set command-line arguments
    my ($INFILE, $CHUNKSIZE, $FMT, $SEP) = @ARGV;
    $CHUNKSIZE //= 500;
    $FMT //= "data-%08d.csv";
    $SEP //= ",";
    
    # Initialize CSV, file handles, and counters
    my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, sep_char => $SEP, eol => "\n" });
    my ($i, $f, $out) = (0, 1, undef);
    open my $in, "<:encoding(UTF-8)", $INFILE or die "Cannot open $INFILE: $!";
    
    # Main loop
    while (my $row = $csv->getline($in)) {
        if ($i % $CHUNKSIZE == 0) {
            close $out if defined $out;
            open $out, ">:encoding(UTF-8)", sprintf($FMT, $f++) or die "Cannot open output file: $!";
        }
        $csv->print($out, $row) or die "Failed to write row: $!";
        $i++;
    }
    
    # Clean up: close file handles
    close $out if defined $out;
    close $in;

👤 sancarn
Untested but something like this in VBA should suffice:

    Dim rng as range: set rng = Sheet1.UsedRange
    Dim rows as Long: rows = rng.rows.count
    Dim cols as long: cols = rng.columns.count
    Const SIZE as long = 500
    For i = 2 to rows step SIZE
      Dim wb as workbook: set wb = workbooks.add()
      wb.sheets(1).range("A1").resize(1, cols).value = rng.resize(1).value
      wb.sheets(1).range("A2").resize(SIZE,cols).value = rng.offset(i-1).resize(SIZE).value
      Call wb.SaveAs("C:\Temp\" & i & ".xlsx")
    next

👤 kstrauser
Can you confirm that said file is purely static data? Chunking a file like that is one thing. Handling one with cell references is a different animal.

👤 tetris11
I would naturally do it in Emacs Lisp for its superior excel reading abilities

    (require 'csv-mode)
    
    (let ((input-file "bigfile.xls")
          (output-dir "chunked/")
          (chunksize 1000))
      (cl-labels ((read-csv-file (filename)
                    (with-temp-buffer
                      (insert-file-contents filename)
                      (csv-mode)
                      (csv-parse-buffer t)))
                  (write-csv-file (filename data)
                    (with-temp-buffer
                      (csv-mode)
                      (dolist (row data)
                        (csv-insert-list row))
                      (write-region (point-min) (point-max) filename)))
                  (chunk-data (data chunk-size)
                    (let (result)
                      (while data
                        (push (cl-subseq data 0 (min chunk-size (length data))) result)
                        (setq data (nthcdr chunk-size data)))
                      (nreverse result))))
    
        (let* ((data (read-csv-file input-file))
               (chunks (chunk-data data chunk-size))
               (file-count 1))
          (dolist (chunk chunks)
            (let ((output-file (expand-file-name (format "chunk-%04d.csv" file-count) output-dir)))
              (write-csv-file output-file chunk)
              (setq file-count (1+ file-count))))
          (message "Processing complete. %d files created." (1- file-count)))))
    

(This is a joke. Do not use this.)

👤 arh68
The C# interop API will surely work, but I wouldn't claim it's user-friendly.

I would strongly consider dumping rows into SQL, for more "natural" selection.

[1] https://learn.microsoft.com/en-us/dotnet/api/microsoft.offic...


👤 tomrod
With something in Excel that needs to chunk, Python is good, but also Go and Rust are good for these situations. ChatGPT is a good starting point to build out some boilerplate.

For readable/not binary files, there are standard tools like split, awk, etc.


👤 qup
That's only 20 cut and paste operations? I would certainly do that manually before trying to code it. And I'm a proficient coder, and not a proficient Excel user.

If it had 100k rows, I'd be out of my depth, so I'd hit google.


👤 binarymax
I’m not at my machine to whip up an example but this is an ideal use case for pandas. You can read or stream the excel file and split it in probably <20 lines of python.

👤 BugsJustFindMe
> Guys this is just an example. I'm looking for a general solution. It could be 10 million rows.

It can't be. Excel's maximum row limit is only ~1 million.


👤 nutrie
It's been well over a decade since I last dealt with Excel, but I remember you could actually query the data with SQL without opening the file, like you would with any flat-file db. If the size is the problem. It was poorly documented but I'd done it a few times and it worked really well. The best part being it was simple, fast and worked even with locked files. Otherwise I don't understand the question.

👤 j45
How you split millions of rows can be very different than shorter files, especially when you run into issues with file input/output issues in different languages.

Opening and working on an excel file with a few million rows can need a bit more ram than anticipated especially based on it's size/complexity.

The quickest way I'd start with is to convert it to a CSV, read the file in, and rewrite it out 500 lines at a time.


👤 klyrs
Not knowing what the data is like; I'd save it in a comma- or tab-separated text format. From there, it's just a few lines of bash.

👤 SPBS
I don't understand what you mean by easier way, because it's always going to be a script. Unless you mean some low code/no cool tool that was already tailor made to do this.

If you're worried about data not fitting in memory, then stream the file. It seems like the Java API has support for this, surely other languages do too.


👤 beefnugs
Does each chunk have to execute properly? Do you have to minimize chunk size by maximum working inter-cell references? Can you split huge chunks with too many references by an intermediate chunk with stage based value propagation?

This turns into a massive graph theory nightmare problem if there are lots of references going on


👤 bena
High level concept:

Xlsx or xls?

If it’s xlsx, stream the file, chunking 500 rows into each new file, watching for ref A1, then injecting that into previous files.

The xlsx format is a zip file of xml, not so bad to work with once you start.

I should note, this works best if you don’t care about the order of the rows. They technically aren’t required to be in order (I think).



👤 MattGaiser
Seems like a 30 second LLM prompt for a Python script.

👤 SuperNinKenDo
With VBA? VBA has everything you need without any conversion or external libraries. Is there some reason you don't want to use it?

👤 allsummer
This is so quaint. I love it.

Not even going to ask AI. Waiting for a bash one-liner before OP reminds us they are on corporate Windows machine.


👤 hgyjnbdet
VBA macro in the original excel file.

👤 guskel
Can wall clock time be proportional to file size or is a parallelized approach needed?

👤 energyguy78
Don't put it in Excel

👤 gidorah
It's Excel, so just write a macro?

👤 brudgers
Why would you want to break it up?

If the input is not a spreadsheet, use a tool appropriate to the scale of the data store. Provide an api for access from Excel if end users need Excel.

If input is an Excel but the integrity of the data as a spreadsheet doesn’t matter,

If it is an Excel already, Excel provides end users with access from Excel. If the Excel is likely to outgrow Excel, use a tool appropriate to the scale of the data. Provide an api for access from Excel if end users need Excel.

If it-is-Execl is your problem, breaking it up into multiple Excels is slicing the magic broom. Good luck.

Good luck.


👤 secwang
convert to csv, Next is simple text processing.Final ,convert csv to excel

👤 bigyikes
Not to be that guy, but this is the kind of task that LLMs are great at assisting with.

👤 nolongerthere
This is a classic xyproblem [0], excel is almost never the answer when you’re dealing with “big” data, you’re almost always better off getting the data in a csv or db format and working on it from your favorite scripting language.

[0] https://xyproblem.info/


👤 mattmanser
Bet it took you more time.to write that code than if you'd done it manually.

If you actually had to do it 'properly' there are actually a ton of options:

  - do it old school with a VBA macro
  - use the newer js macro stuff
  - xslx files are just a zip file of XML so could just do it in pretty much any language