Jump to content

Fast Export failing to deal with non-english chars in branch names


cidico

Recommended Posts

Hi guys!

I just did a fast export operation to avoid the creation of a lot of replication files and I noticed that the fast export tool didn't handle very well some of my branch names.

They contain chars like "ã","ç","õ".

When I did the fast import command, the branch names came with "??" instead of "cã", "çõ" and etc.

Just to clarify, I'm in Brasil and we use a lot of those chars here, I believe that in Spain you do have "strange" chars too.. :)

Link to comment
Share on other sites

Well, I run some tests here and I found a "big" problem here.

I just did a replication package from 1 branch with the wrong name and Plastic "duplicated" the branch.

It now shows both branches, one with the chars and other with the "??" chars.

Luckily, I did a with only one replication package file, I guess that using the fast export and fast import commands it could replicated all wronge named branches.

Is there a "dark way" to delete this branch? The normal way says that I can't because it has revisions...

Link to comment
Share on other sites

  • 1 month later...

Hello guys! :)

I need to give you an update here about special chars again...

It seems that the problem is happening with usernames too.

At home, my user is: Plácido Bisneto

I just imported my code from home and my username appears: Pl?cido Bisneto...

Just updating! :)

Link to comment
Share on other sites

Hi!

It seems that in version 4.0.239.0 the problem with special chars in branch names still here. :(

My project isn't as big as others, but unfortunately, I can't change the branches names right now...

Due to those mistakes that me and my team made, I can't use fast-export / import feature. :( :( :( :( :( :(

As it duplicate the branches (when using fast import) with special chars, replacing chars like "ã","ç","õ" by "??" chars, I guess doing it would lead me to a lot of troubles when fast exporting / importing incrementally.

Link to comment
Share on other sites

  • 3 months later...

Hello my fellow friends from Spain!

I know, I'm "ant on a picnic" about this, but it seems that this particular bug was not corrected in 4.1 :(

Even not being a "urgent" bug, but have you scheduled the correction at least?

I do have a workaround to deal with this, but I feel kinda "dirty" or "cheating" doing it.

Please, forgive my mistakes when dealing using such filth chars! hehehe :)

Link to comment
Share on other sites

Hi, I described a very similar problem here:

http://www.plasticscm.net/index.php?/topic/930-importing-tfs-project-into-plastic/

I tried to use different encodings for the "author" and "committer" tags in fast-export using a perl script, but special characters always appear as "?" in the changeset "Created by" column in the Plastic GUI. If encoding iso-8859-1 my name is "S?ren", if I use UTF8, my name is "S??ren".

I double checked the encoding in an encoding aware text-editor.

For the record I used 4.0.239.24

Link to comment
Share on other sites

So it seems that I'm not the only one who this problem affects.

I really don't know what's the encoding, I suppose it's UTF-8, since is the Plastic itself who's generating the file.

One thing I've noticed a long time ago when I first found this issue:

When exporting, Plastic does show all the branches names correctly in cmd.

The error seems to be happening only when importing.

Is it the same thing with you?

Link to comment
Share on other sites

I am importing from a fast-export made with Git. The Git repository comes from a git-tfs export from Microsoft TFS.

The fast-export tags that Plastic uses to register "Created by" is either "author" or "committer", which is the same in all cases for me.

It is these tags that I have tried replacing with different encoding using a simple perl script.

Link to comment
Share on other sites

I had a similar problem to Soho's when importing from VSS -> Git -> Plastic. My file paths with UTF8 characters were exported as octal strings (/235 etc) and spaces were not being quoted correctly. In the end I wrote a utility to fix it up.

I'm not sure why your user names are coming out weird though. Have you looked at the binary dump of the data? Is it actually UTF8 or some other encoding?

Link to comment
Share on other sites

Yes, it is UTF8. I have checked.

If I look at the git fast-export with a hex editor, the nordic o-slash (ø) is encoded "c3 b8", which is the correct UTF8 code. This appears as ?? in Plastic. I have also tried iso-8859-1 with the same result (only one ? though).

Link to comment
Share on other sites

Hi

I had some isues with UTF-8 encoding in some software I developer some weeks ago. UTF-8 does not require a 3 byte long BOM (Byte Order Marker) at the start of the text file but ommitting it makes it impossible for applications to guess the encoding. To see the bytes have a look at the first bytes of the file and see if they are 0xEF,0xBB,0xBF. (see http://en.wikipedia.org/wiki/Byte_order_mark).

If ommitted UTF-8 basically is a one byte encoding with an occasional two byte code (hence the two ? marks where you expect single character).

Link to comment
Share on other sites

It is correct that GIT does not prepend a BOM in the fast-export. The question is, will it make a difference to the Plastic importer? And if the BOM is not present, does Plastic use a default encoding?

Link to comment
Share on other sites

The importer could have a default encoding, since git does not write the BOM. Actually I am not sure that git uses a specific encoding. It could also just write the author names in whatever bytes used to represent the author string internally in git.

But since git does not write the BOM, I would expect the Plastic importer to use some kind of default encoding and it would be nice to know what that encoding was.

Link to comment
Share on other sites

My point is still that while it may be a good idea to include a BOM, git does not do so even though the git documentation recommends the use of UTF8 and possibly encodes author names with this encoding (I haven't verified that).

Perhaps I could patch my (8 GB+) fast-export with a BOM and Plastic would read the author names correctly, but it is a hassle and other users will probably end up with the same problems.

If git uses UTF8 internally I would suggest that Plastic defaults to that encoding, BOM or not, if git just stores the author names in whatever encoding used by the committer, then it would make more sense to prepend the fast-import with a BOM, but it would still be nice to know which encoding Plastic uses by default if a BOM is absent.

It should be possible to import a git fast-export into Plastic without patching the fast-export file. This is not the case now. (See my other post with a couple of other issues with the importer).

It is a hard selling point to TFS'ers that if they want to convert to Plastic they should reserve a lot of time and patience before they get a clean import.

Link to comment
Share on other sites

Mother of god.

I didn't knew that this problem would go so deep.

Why the hell git did not includes the BOM? Does anyone here really uses git or used to?

But SoHo has a point when he says:

It is a hard selling point to TFS'ers that if they want to convert to Plastic they should reserve a lot of time and patience before they get a clean import.

Not everybody would easily accept this.

Link to comment
Share on other sites

  • 2 weeks later...

I was taking a look to this and indeed, we do well for paths but we don't have covered the case of special characters in branch names, "author" and "committer" tags. I'm going to take care of it.

Assuming you are a Plastic developer, please take a look at the other reported problems (mixed casings in paths, path quotes, etc.)

Link to comment
Share on other sites

Hi Soho!

Yes, I'm a member of Plastic development team :)

Say that I have already fixed Cidico's issues with branch and user names, but I would like to be sure that I have also fixed the rest of the reported problems. Could you send an export file you have which plastic does not import right? Or at least, some cases you know we fail during the import.

I would like to make this fixed for everyone before closing the task!

Thanks in advanced.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...