Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 22, 2026

Rationale for this change

The JSON test utility GenerateAscii was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.

Are these changes tested?

There are existent tests for JSON.

Are there any user-facing changes?

No, test-only.

@github-actions
Copy link

⚠️ GitHub issue #48941 has been automatically assigned in GitHub to PR creator.

Comment on lines 181 to 183
// Using c_str() is safe here because generation excludes U+0000 (no embedded nulls).
// U+0000 can only exist in plane 0 (BMP), and BMP generation starts at U+0020.
return OK(writer.String(s.c_str()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call writer.String(s) actually.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 26, 2026
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a particular concern around this that led you to submit this PR?

In any case, I think it might be worth having a more generic helper in arrow/testing/random.h or anything. Something like:

std::string RandomUtf8String(int num_chars);

@HyukjinKwon
Copy link
Member Author

There was a todo // FIXME generate UTF8, and I am trying to kill those TODOs ... some of them are not really actionable or overkill. I plan to swipe them away once I kill all actionable ones.

Let me take a look at the suggestion. I am also happy with just removing this TODO out if that doesn't sound quite worthwhile ..

Comment on lines +1479 to +1480
std::random_device rd;
std::default_random_engine gen(rd());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the seeding deterministic? Perhaps pass an explicit seed to this function. We want tests to be reproducible reliably.

(also, we're using pcg32 in other random generation functions, perhaps use it here too?)


for (int i = 0; i < num_chars; ++i) {
uint32_t codepoint;
std::uniform_int_distribution<uint32_t> plane_dist(0, 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how expensive it is to instantiate all those distributions at each loop iteration. Perhaps they can be moved out of the loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants