View Issue Details Jump to Notes ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0015326CMakeCMakepublic2014-12-24 12:482015-06-01 08:38
ReporterKwabena W. Agyeman 
Assigned ToClinton Stimpson 
PrioritynormalSeverityfeatureReproducibilityalways
StatusclosedResolutionfixed 
PlatformOSOS Version
Product VersionCMake 3.1 
Target VersionFixed in Version 
Summary0015326: CMake can't read UTF-16 source files
DescriptionI'm trying to do dependency parsing in cmake for parallax propeller spin files. This involves opening up *.spin code files and parsing them using the file(STRINGS ... REGEX ...) command. Most spin files are saved as ASCII/UTF-8 files and there are no problems parsing these files. However, some spin files are saved in UTF-16 format which allows the use of special schematic characters in the files. Using the file(STRINGS ... REGEX ...) command does not work on UTF-16 files as cmake is unable to parse them.
Steps To ReproduceThe below code should find the "DAT" string in the attached file...

file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
Additional InformationSaving the problem UTF-16 files as UTF-8 files fixes the issue. But, I can't just convert the source files. There's a lot of legacy spin code I need to support that uses the UTF-16 format for the special schematic characters.
TagsNo tags attached.
Attached Files? file icon Quadrature Encoder.spin [^] (32,690 bytes) 2014-12-24 12:48
patch file icon 0001-add-initial-file-STRINGS-utf16-support.patch [^] (2,882 bytes) 2015-01-03 14:48 [Show Content]
patch file icon 0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch [^] (3,649 bytes) 2015-01-11 15:49 [Show Content]
patch file icon 0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch [^] (3,461 bytes) 2015-01-14 00:36 [Show Content]
patch file icon 0001-Compare-file-STRINGS-encodings.patch [^] (4,045 bytes) 2015-01-16 19:54 [Show Content]
patch file icon 0002-try2-file-STRINGS-to-strings-comparison.patch [^] (1,493 bytes) 2015-01-16 20:05 [Show Content]
patch file icon 0002-try3-file-STRINGS-to-strings-comparison.patch [^] (1,643 bytes) 2015-01-16 20:13 [Show Content]
patch file icon 0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch [^] (2,508 bytes) 2015-01-23 21:00 [Show Content]

 Relationships

  Notes
(0037527)
Clinton Stimpson (developer)
2014-12-24 13:21

UTF-8 support was added recently, and UTF-16 could be added as well with a patch to cmFileCommand.cxx.

As a workaround, you can probably add a cmake command to copy the UTF-16 files to your build directory and convert them to UTF-8 to run file(STRINGS...), but still use the original UTF-16 elsewhere.
(0037528)
Kwabena W. Agyeman (reporter)
2014-12-24 15:07
edited on: 2014-12-24 15:08

I will just wait until UTF-16 support is added. I hope this bug gets pulled into the next release since it shouldn't be too much trouble.

I can't depend on any external commands being available for a workaround.

(0037547)
Justin Borodinsky (reporter)
2015-01-03 14:52

I uploaded a patch to master, tested with your file, with the commands below. Currently only tested on Windows.

cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)

file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)

file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")

cmake_print_variables(OUT_BE OUT OUT2)
(0037548)
Clinton Stimpson (developer)
2015-01-03 23:35

Thanks for providing a proposed patch.

However, I see some issues with the patch:

1. it doesn't work on non-Windows platforms where wchar_t is 32 bits and is commonly used for UTF-32. Encoding::ToNarrow() does a UTF-16 to narrow conversion on Windows and a UTF-32 to narrow conversion on other platforms. Encoding::ToNarrow() cannot be used unconditionally on all platforms to handle UTF-16.

2. I see no handling for files without a BOM. It should handle both binary and text files, with and without BOM. For example, on Linux, I can compile a binary, or make a binary file, which contains any combination of UTF-8, UTF-16LE, UTF-16BE, UTF-32BE, UTF-32LE, and run the command "strings -e b file" to extract only the UTF-16BE strings.
For example, see https://sourceware.org/binutils/docs/binutils/strings.html [^] for how encoding options are specified. I think it would be good to have CMake behave the same as GNU's string tool.

3. There are no checks for invalid UTF-16 sequences. This is especially important to distinguish UTF-16 embedded in a binary file.

4. It fails on UTF-16 surrogate pairs. Instead of reading 2 bytes, one needs to read 4 bytes from the file before converting to UTF-8.
(0037551)
Kwabena W. Agyeman (reporter)
2015-01-04 10:44

Thanks for working on this! :)
(0037634)
Kwabena W. Agyeman (reporter)
2015-01-08 12:18

@Justin Borodinsky

It would preferable if CMake could autodetect UTF-16 files. They all have a BOM of 0xFE followed by 0xFF. So, it's easy to detect them out from regular text files. I assume CMake can already handle the UTF-8 BOM. http://en.wikipedia.org/wiki/Byte_order_mark [^]
(0037658)
Justin Borodinsky (reporter)
2015-01-09 23:38

@Clinton

Thanks for the feedback, a few questions:
1 - I can use char16_t and facets to handle the UTF-16: http://en.cppreference.com/w/cpp/locale/codecvt [^] Let me know if you have any thoughts on using this approach.
2 - strings extracts printable characters using the ASCII code page only (encoded in various forms as you mentioned). It appears from the UTF-8 extraction currently in place that the intent is not to limit files to these codes only. See the STRING_ISGRAPHIC macro in strings.c.
3 - strings distinguishes printable characters by looking for a sequence of printable characters. Invalid UTF-16 is simply skipped. Should CMake do the same?

@Kwabena
Using the BOM to identify the whole file as a certain encoding would eliminate the need to identify strings from a binary file
(0037660)
Clinton Stimpson (developer)
2015-01-10 16:00

Justin,

char16_t is only available with C++11. CMake still needs to be compatible with older compilers, so I don't think that is an option.

Its interesting that GNU strings will extract only UTF-16 that can be converted to ASCII (even some valid UTF-16 such as "Ūnĭcōde̽" is ignored). I guess the same for CMake is fine. I previously thought it did more than that.

It seems those limited requirements will simplify the implementation.
(0037667)
Justin Borodinsky (reporter)
2015-01-11 15:48

Hi Clinton,

Yes, the implementation is simple now. Attached is a new proposed patch, used as follows:

cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)

file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")

#Output should match strings | grep error
file(STRINGS "/usr/lib64/libm.so" OUT_BIN REGEX "error")

#Output should match strings -e L, default strings length is 4
file(STRINGS "/usr/lib64/libm.so" OUT_BIN_UTF32LE REGEX "...." ENCODING UTF-32LE)

cmake_print_variables(OUT_BE OUT OUT2 OUT_BIN OUT_BIN_UTF32LE)

Let me know what you think. Thanks.
(0037681)
Clinton Stimpson (developer)
2015-01-14 00:43

Hi Justin,

I've attached a patch which adds a test to CMake to cover the new functionality in your patch.
The test currently fails and I think your patch needs some fixing.

For example, I'm getting strings back out like this:
"Hello World;; ;;"
or
"H;e;l;l;o;
  ;W;o;r;l;d;;; "

The test files have 4 lines to them (including a BOM):
1. Hello World
2. Hello World in chinese
3. Hello World in Russian
4. Hello World with a corruption

I think there is a second problem where "strings -e {l,L,b,B}" all ignore the 4th line, but I think your patch is sometimes picking up the 4th line.
(0037727)
Justin Borodinsky (reporter)
2015-01-16 19:54

Hi Clinton,

Thanks for the test case, I should of put one together with the patch.

To compare the output to strings you would need to add REGEX "...." since strings by default only outputs a sequence of at least 4 chars. Attached is a slightly modified version of your test, to compare the output to strings. For the 16 calls made the outputs match between StringFileTest and the runstrings.sh script.
(0037728)
Justin Borodinsky (reporter)
2015-01-16 20:05

Apologies, didn't mean to step on your patch with the minor mods, attached is a patch to your patch in reference to the note above
(0037766)
Clinton Stimpson (developer)
2015-01-21 21:45

Justin,

I hope I have the right set of patches together, and when I run the test, I get a failure.
Are you seeing this?

...
28: UTF-32LE.txt as UTF-32LE: Hello World
28: UTF-32LE.txt as UTF-32BE:
28: CMake Error at CMakeLists.txt:78 (message):
28: file(STRINGS) incorrectly read from UTF-32LE.txt file []
28:
28:
28: UTF-32BE.txt as UTF-16LE:
28: UTF-32BE.txt as UTF-16BE:

I have my set of patches applied here:
https://github.com/clintonstimpson/CMake/tree/file-strings-utf-16 [^]
(0037801)
Justin Borodinsky (reporter)
2015-01-23 20:59

Hi Clinton,

Thanks, yes those were the right set of patches. I am seeing the same failure, but I suppose my question is why does each file need to be read using every other encoding? For example, the failure is when it's reading the UTF-32LE file as UTF-32BE, and it's expecting to match "Hello World", but receives "". The output of 'strings' when reading the UTF-32LE file as UTF-32BE also produces no output.

Attached is another patch to your file-strings-utf16 branch which just simplifies the tests a bit (and also removes the temporary shell script). Is this set of tests sufficient, or was there a reason for the two inner loops I'm missing?
(0037812)
Clinton Stimpson (developer)
2015-01-26 10:40

I was just trying to be thorough in testing, but if you don't think its really necessary, that is fine.

Thanks for your work on this. I've merged it for you.
http://www.cmake.org/gitweb?p=cmake.git;a=commit;h=d4506d17 [^]
(0037828)
Justin Borodinsky (reporter)
2015-01-26 20:46

Great, thank you for your help.
(0037840)
Brad King (manager)
2015-01-27 11:37

The revised commit implementing this has now been merged to 'master':

 file: Teach STRINGS to support UTF-16 and UTF-32 encodings
 http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=1f77a700 [^]

Thanks!
(0038862)
Robert Maynard (manager)
2015-06-01 08:38

Closing resolved issues that have not been updated in more than 4 months.

 Issue History
Date Modified Username Field Change
2014-12-24 12:48 Kwabena W. Agyeman New Issue
2014-12-24 12:48 Kwabena W. Agyeman File Added: Quadrature Encoder.spin
2014-12-24 13:21 Clinton Stimpson Note Added: 0037527
2014-12-24 15:07 Kwabena W. Agyeman Note Added: 0037528
2014-12-24 15:08 Kwabena W. Agyeman Note Edited: 0037528
2015-01-03 14:48 Justin Borodinsky File Added: 0001-add-initial-file-STRINGS-utf16-support.patch
2015-01-03 14:52 Justin Borodinsky Note Added: 0037547
2015-01-03 23:35 Clinton Stimpson Note Added: 0037548
2015-01-04 10:44 Kwabena W. Agyeman Note Added: 0037551
2015-01-08 12:18 Kwabena W. Agyeman Note Added: 0037634
2015-01-09 23:38 Justin Borodinsky Note Added: 0037658
2015-01-10 16:00 Clinton Stimpson Note Added: 0037660
2015-01-11 15:48 Justin Borodinsky Note Added: 0037667
2015-01-11 15:49 Justin Borodinsky File Added: 0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch
2015-01-14 00:36 Clinton Stimpson File Added: 0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch
2015-01-14 00:43 Clinton Stimpson Note Added: 0037681
2015-01-16 19:54 Justin Borodinsky Note Added: 0037727
2015-01-16 19:54 Justin Borodinsky File Added: 0001-Compare-file-STRINGS-encodings.patch
2015-01-16 20:05 Justin Borodinsky Note Added: 0037728
2015-01-16 20:05 Justin Borodinsky File Added: 0002-try2-file-STRINGS-to-strings-comparison.patch
2015-01-16 20:13 Justin Borodinsky File Added: 0002-try3-file-STRINGS-to-strings-comparison.patch
2015-01-21 21:45 Clinton Stimpson Note Added: 0037766
2015-01-23 20:59 Justin Borodinsky Note Added: 0037801
2015-01-23 21:00 Justin Borodinsky File Added: 0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch
2015-01-26 10:40 Clinton Stimpson Note Added: 0037812
2015-01-26 10:40 Clinton Stimpson Status new => resolved
2015-01-26 10:40 Clinton Stimpson Resolution open => fixed
2015-01-26 10:40 Clinton Stimpson Assigned To => Clinton Stimpson
2015-01-26 20:46 Justin Borodinsky Note Added: 0037828
2015-01-27 11:37 Brad King Note Added: 0037840
2015-06-01 08:38 Robert Maynard Note Added: 0038862
2015-06-01 08:38 Robert Maynard Status resolved => closed


Copyright © 2000 - 2018 MantisBT Team