Re-approaching the Project Euler Problems: Dealing with large files

Happy new year everyone! Welcome to 2022 and let’s hope it’s at least a bit better than the last couple of years have turned out to be.

Over the break I was tweaking with my Project Euler repo, and ran into a problem that part of me always suspected I might eventually hit at some point: one of my files (either a results CSV or an expected answers JSON) being too big and GitHub eventually saying “no, you can’t host that here”. I always saw this as an “eventually” issue though rather than a “during Christmas 2021” issue.

Whilst starting work on problem 2, I noticed that the numbers involved would be considerably larger, especially as the problem itself expects a default input of 4 million rather than the 10,000 in problem 1. So I got to work as I had done with problem 1, manually calculating the results against inputs of up to 40 and then checking my Python script against that, before deciding I could then trust it to generate answers all the way up to 4 million. All good although I must confess it took a while!

Now to do some other bits and pieces before retiring for the evening, and now to git push:

remote: Resolving deltas: 100% (24/24), completed with 7 local objects.
remote: error: Trace: aa212c3521a5fdbf4c114882235a794bf0c397722cee81565295fe45a1c5e3d3
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File problem_2/problem_2_expected_answers.json is 222.32 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
To https://github.com/gavinsykes/project-euler.git
! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/gavinsykes/project-euler.git'

Yikes.

There is quite an alarming amount of red there, by which I mean there is any red at all. And that isn’t me just highlighting bits red for emphasis, that is git itself printing red characters to the terminal.

Luckily, after having taken a look at the mentioned git-lfs it seems to be really quite simple to use, just tell it which files you expect to be larger than 100MB and it will sort them all out for you.

brew install git-lfs
git install lfs
git-lfs track "*.csv"
git-lfs track "*_expected_answers.json"

This should create a .gitattributes file with the following content:

*.csv filter=lfs diff=lfs merge=lfs -text
*_expected_answers.json filter=lfs diff=lfs merge=lfs -text

But there is still a problem, I had committed the large file (which I suspect was the expected_answers.json file for problem 2) somewhere within the last 13 commits, before having installed LFS. This means that even though installing LFS brought up the files I asked it to track so I could recommit them, I still had a commit that was trying to sync with the large file not tracked by LFS, meaning it still didn’t want to know.

So how do I manage this? I believe I have found the solution.

Run git status and it should tell you that Your branch is ahead of 'origin/master' by 13 commits. (Your number of commits may vary.)

Delete the suspected offending file(s) on your local machine and commit the deletion.

Reset back the relevant nuber of commits, this should now be 14 (in my case it was 15 because I decided to tweak some other scripts in the middle of doing this, but don’t do that, why would you do that? Why would you make it more complicated than it needs to be unless you’re an idiot like me?)

git reset --soft HEAD~15

If you’re in VSCode, you should see all the changes you made within the last x commits reappear in your staged changes, we can now “squash” them into a single commit, this 1 commit should now push to remote sin problema.

Now for the moment of truth, LFS is now all set up and appears to have been working on the current (not too big, yet) JSON and CSV files, let’s try it on the problem 2 expected answers JSON!

Uploading LFS objects: 100% (1/1), 190 MB | 1.3 MB/s, done.
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 12 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 463 bytes | 463.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To https://github.com/gavinsykes/project-euler.git
0e78144..9ac5c2d master -> master

So, other than the remarkably low upload speed of 1.3MB/s (my router isn’t the greatest and I’m not exactly close to it), I think we can call that a success! 😁😁