-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities #48943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
cpp/src/arrow/json/test_common.h
Outdated
| // Using c_str() is safe here because generation excludes U+0000 (no embedded nulls). | ||
| // U+0000 can only exist in plane 0 (BMP), and BMP generation starts at U+0020. | ||
| return OK(writer.String(s.c_str())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call writer.String(s) actually.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a particular concern around this that led you to submit this PR?
In any case, I think it might be worth having a more generic helper in arrow/testing/random.h or anything. Something like:
std::string RandomUtf8String(int num_chars);|
There was a todo Let me take a look at the suggestion. I am also happy with just removing this TODO out if that doesn't sound quite worthwhile .. |
| std::random_device rd; | ||
| std::default_random_engine gen(rd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the seeding deterministic? Perhaps pass an explicit seed to this function. We want tests to be reproducible reliably.
(also, we're using pcg32 in other random generation functions, perhaps use it here too?)
|
|
||
| for (int i = 0; i < num_chars; ++i) { | ||
| uint32_t codepoint; | ||
| std::uniform_int_distribution<uint32_t> plane_dist(0, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how expensive it is to instantiate all those distributions at each loop iteration. Perhaps they can be moved out of the loop.
Rationale for this change
The JSON test utility
GenerateAsciiwas only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.What changes are included in this PR?
Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.
Are these changes tested?
There are existent tests for JSON.
Are there any user-facing changes?
No, test-only.