Large Binaries

iandunlop · March 15, 2015

Hi,

We currently use git / bitbucket. The project is at a point where we need to look at solving the large binary file issue.

I see a lot of information about how awesome PlasticSCM is at handling large binaries. However, I can't find any explanation about how it handles them.

How does PlasticSCM actually handle large binaries?

Does gitSync support your way of handling large binaries?

Thanks for your time.

Cheers,

Ian Dunlop

manu · March 16, 2015

Hi Ian,

the good point about Plastic SCM and big files is that you don't have to care about how to handle them.... You just use checkin operation and that's all.

I you notice that the database is getting too much big because you are uploading a lot of big files (game assets, textures...) then you can archive those revisions into an external drive: https://www.plasticscm.com/documentation/administration/plastic-scm-version-control-administrator-guide.shtml#Chapter10:Archivingrevisions But this is only useful when the server is running out of HD.

Regarding the gitSync support for big binaries files, we do support them so pushing or pulling big files to Plastic shouldn't be a problem, I think git is going to be the bottleneck in your case.

iandunlop · March 16, 2015

Regarding the gitSync support for big binaries files, we do support them so pushing or pulling big files to Plastic shouldn't be a problem, I think git is going to be the bottleneck in your case.

Thanks for the info. However, I still don't understand how Plastic handles them.

Can you explain the statement "I think git is going to be the bottleneck in your case." further. Are you referring to the max size of the allowed repo on the bitbucket hosting site or some other limit?

Does it only store the latest version for the file? i.e. it doesn't keep a history on them?

How does it know they are large binary files? Do I need to mark folders or files in a specific way when using gitsync?

Does it do this while using gitsync?

Thanks again!

manu · March 17, 2015

Thanks for the info. However, I still don't understand how Plastic handles them.

What exactly do you want to know? How technically it's sent and stored inside the repo? Well, the file (big or not) it's compressed and sent to the server in 4MB chunks packets, then the server will store the file content and the changeset metadata inside the database.

Note: Compression can be optionally disabled for certain file extensions when you know for sure the file can't be compressed. You will save time during the checkin operation since it won't try to compress something that it's already compressed.

Can you explain the statement "I think git is going to be the bottleneck in your case." further. Are you referring to the max size of the allowed repo on the bitbucket hosting site or some other limit?

Git is not good handling (reading/writing) big files so if you are exporting data to Plastic (from Git) Git might have issues reading/processing the big files. The same might happen while pushing data from Plastic to Git, Git can have issues adding them to the Git repository.

Does it only store the latest version for the file? i.e. it doesn't keep a history on them?

Plastic will store all the revisions, even if you archive the revision content of a file (cm archive) the metadata will be remain at the repo so the history

command will continue working.

How does it know they are large binary files? Do I need to mark folders or files in a specific way when using gitsync?

Plastic is size agnostic, you add a file and Plastic will store it.

The upload time will be obviously bigger just because the network communications, but for Plastic it's just like processing another file. There's no way to set any parameter during the gitsync operation.

Does it do this while using gitsync?

I don't understand that you are asking

Seto · March 18, 2015

Hi. I'm interested in it as well. What's the differences between git and plastic for large files? Why is git unable to handle large files? If you guys talk about the size limit on github, gitbucket or something else, the size limit can be modified on self host git server. So I don't understand what git not do but plastic do to handle big files. The only difference I see is plastic split workspace and repository. Then the workspace won't store all the revisions. Then it costs less time to pull the changes in centralized mode. But git seems to be able to pull last revisions for count limit as well.

After these days using plastic, I can see it with stronger merge(Xmerge or SemanticMerge) than git clients. But I couldn't see the advantage for large files as well. I would be great if admin can clarify the advantages on large files. Because I'm developing game projects.

iandunlop · March 18, 2015

Ok, I'm more confused now than when I started this forum post.

git currently handles my large files. The issue is that over time as the history grows the size of the git repo becomes too large for most hosting services. I could run my own git repo on a server without any hard limits on disk space. The repo would be huge by the end of the project (for sure), but if I'm running my own server then I don't really care about the size of the repo. Cloning it might be painful though...

There are git patches that support storing large binaries outside of your main repo. That seems to trade convenience of history for repo size. That sounds like something that could work. Just not sure what plastic brings to the table here.

It would be great to get an official explanation as to the advantages plastic has over git in terms of large binary files. I see the phrase / wording repeated many times throughout the site but still have no idea what it really means.

manu · March 18, 2015

Hi guys!

first I'll let Linus himself answer you about the issues with big files:

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

When you add big files (hundreds of MB for each one) you will notice that the Git performance will start decreasing, maybe a "git gc" or "git repack" will save the day or maybe not. I've been talking with several Git users developing video games and they told me that the performance was really a pain. As I said it happens with big files because with regular files it works pretty fine.

I suffered it myself migrating big repositories into git (700K files working copy/12 GB), Out of memory issues where a regular thing. If you didn't suffer that's great, your big files are not so big and Git is able to handle them but I can tell for sure that there's a lot of people even migrating back to SVN because of this reason, yes, SVN is able to handle big files too .

Then you have Atlassian or Github trying to show you how to handle big files with Git.

http://blogs.atlassian.com/2014/05/handle-big-repositories-git/

Which basically consists in not committing them into the repository (git-annex) and workaround the issue or deny you uploads greater than 50MB. You really think its because they want to save disk (come on, Mega gives you 50GB for free and Dropbox 1TB for 10$)? It's because they just want to avoid issues.

What plastic allows you is to have the chance of committing those files into the repository too with no restrictions. It's not rocket science neither I mean you want to commit a huge file? Ok we do it.

We store the revisions data inside databases (MySQL, SQLServer, SQLite, PostgreSQL) that brings us benefits over the internal Git database (.get), atomic ops by default etc...

Having the repository and workspace not necessarily it's better for us talking about performance because an update or checkin/commit operation should be slower as we need to transfer the information to another place, git only has to put it inside "../.git/". Git relies in RAM memory while doing certain operations, that's a weak point when you deal with huge files, we don't so we are stronger in that point.

iandunlop · March 18, 2015

That helps my understanding of this issues and what plastic provides in this area. Thanks!

manu · March 19, 2015

You are very welcome! If you have any other question or you need more details I'll be happy to answer

@Seto same for you!

Sign In

Large Binaries

Recommended Posts

iandunlop

Link to comment

Share on other sites

manu

Link to comment

Share on other sites

iandunlop

Link to comment

Share on other sites

manu

Link to comment

Share on other sites

Seto

Link to comment

Share on other sites

iandunlop

Link to comment

Share on other sites

manu

Link to comment

Share on other sites

iandunlop

Link to comment

Share on other sites

manu

Link to comment

Share on other sites

Archived

Browse

Activity