SVN使用不同的字符集合并分支

We have a website in which we use ISO-8859-1 and like to move to UTF-8. It's made in PHP and the process is easy and well documented.

In our case, because we have this website in different countries, we'd like to try it in just one country. We do this many times. The structure we follow is very simple: branching the trunk of code and deploy de branched code to production. To keep the branch updated we just merge de changes from trunk to branch until we reintegrate and close this feature branch.

We'd like to test it in just one country in other to reduce the impact if we make a mistake.

With any other kind of changes it works very well, but in this case, after moving to UTF-8 I won't be able to do a merge trunk changes to branch to keep it up to date.

I've been trying to find something related to this without success.

Do you know if there is any way of doing the merging between different charsets properly?

Thank you very much, Grego

I had the same problem and I solved it in the following way:

Install python3 with chardet package pip install chardet. Install diff3 util, on Windows you can get it from MinGW.

Edit svn config file (on Windows %APPDATA%\Subversion\config)

[helpers]
diff3-cmd = C:\\diff3wrap.bat # note \\ instead of \

Where C:\diff3wrap.bat

@echo off
SETLOCAL ENABLEEXTENSIONS
set pythondir=path\to\python3\dir
set mingwdir=path\to\diff3\dir

set pythonpath=%pythondir%lib\site-packages\;%pythonpath%
set path=%pythondir%;%mingwdir%;%path%
rem svn pass to diff3-cmd arguments suitable for diff3 util
rem e.g. -E -m -L .working -L .merge-left.r5 -L .merge-right.r6 path\to\temp\local\file path\to\temp\base\file path\to\tempemote\file
python C:\diff3.py %*

Where C:\diff3.py

#!python3
import codecs
import sys
from subprocess import Popen, PIPE

from chardet.langcyrillicmodel import Ibm866Model, Win1251CyrillicModel
from chardet.sbcharsetprober import SingleByteCharSetProber
from chardet.universaldetector import UniversalDetector
from chardet.utf8prober import UTF8Prober

detector = UniversalDetector()
# leave only necessary probers in order to speed up encoding detection
detector._mCharSetProbers = [ # in new chardet use _charset_probers
    UTF8Prober(),
    SingleByteCharSetProber(Ibm866Model),
    SingleByteCharSetProber(Win1251CyrillicModel)]


def detect_encoding(file_path):
    detector.reset()
    for line in open(file_path, 'rb'):
        detector.feed(line)
        if detector.done:
            break
    detector.close()
    encoding = detector.result["encoding"]
    # treat ascii files as utf-8
    return 'utf-8' if encoding == 'ascii' else encoding


def iconv(file_path, from_encoding, to_encoding):
    if from_encoding == to_encoding:
        return
    with codecs.open(file_path, 'r', from_encoding) as i:
        text = i.read()
    write_to_file(file_path, text, to_encoding)


def write_to_file(file_path, text, to_encoding):
    with codecs.open(file_path, 'bw') as o:
        write_bytes_to_stream(o, text, to_encoding)


def write_bytes_to_stream(stream, text, to_encoding):
    # if you want BOM in your files you should add it by hand
    if to_encoding == "UTF-16LE":
        stream.write(codecs.BOM_UTF16_LE)
    elif to_encoding == "UTF-16BE":
        stream.write(codecs.BOM_UTF16_BE)
    stream.write(text.encode(to_encoding, 'ignore'))


def main():
    # in tortoise svn when press 'merge' button in commit dialog, some arguments are added that diff3 tool doesn't know
    for f in ['--ignore-eol-style', '-w']:
        if f in sys.argv:
            sys.argv.remove(f)

    # ['diff3.py', '-E', '-m', '-L', '.working', '-L', '.merge-left.r5',  '-L', '.merge-right.r6',
    # 'local_path', 'base_path', 'remote_path']
    local_path = sys.argv[-3]
    local_encoding = detect_encoding(local_path)

    base_path = sys.argv[-2]
    base_encoding = detect_encoding(base_path)

    remote_path = sys.argv[-1]
    remote_encoding = detect_encoding(remote_path)

    # diff3 doesn't work with utf-16 that's why you have to convert all files to utf-8
    aux_encoding = 'utf-8'
    iconv(local_path, local_encoding, aux_encoding)
    iconv(base_path, base_encoding, aux_encoding)
    iconv(remote_path, remote_encoding, aux_encoding)

    sys.argv[0] = 'diff3'
    p = Popen(sys.argv, stdout=PIPE, stderr=sys.stderr)
    stdout = p.communicate()[0]
    result_text = stdout.decode(aux_encoding)
    write_bytes_to_stream(sys.stdout.buffer, result_text, local_encoding)

    # in case of conflict svn copy temp base file and temp remote file next to your file in working copy
    # with names like your_file_with_conflict.merge-left.r5 and your_file_with_conflict.merge-right.r6
    # if you resolve conflicts using merge tool, it will use this files
    # if this files and file in your working copy have different encodings,
    # then after conflict resolution your working file change encoding and this is bad
    # that's why you have to convert temp files to local file encoding
    iconv(base_path, aux_encoding, local_encoding)
    iconv(remote_path, aux_encoding, local_encoding)
    sys.exit(p.returncode)


if __name__ == '__main__':
    main()