As a Turkish speaker who was using a Turkish-locale setup in my teenage years these kinds of bugs frustrated me infinitely. Half of the Java or Python apps I installed never run. My PHP webservers always had problems with random software. Ultimately, I had to change my system's language to English. However, US has godawful standards for everything: dates, measurement units, paper sizes.
When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.
If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.
There are OS-level settings for date and unit formats but not all software obeys that, instead falling back to using the default date/unit formats for the selected locale.
They’re about as independent as system language defaults causing software not to work properly. It’s that whole realm of “well we assumed that…” design error.
> > However, US has godawful standards for everything: dates, measurement units, paper sizes.
> Isn't the choice of language and date and unit formats normally independent.
You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).
On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.
My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!
An english imperial measurement. Measurements made based on actual stone rock and were mainly use as weighing agricultural items such as animal meat and potatoes. We also used tons and pounds before we incorporated the metric system of Europe.
> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.
> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake.
It wasn’t a mistake for local software that is supposed to automatically use the user’s locale. It’s what made a lot of local software usefully locale-sensitive without the developer having to put much effort into it, or even necessarily be aware of it. It’s the reason why setting the LC_* environment variables on Linux has any effect on most software.
The age of server software, and software talking to other systems, is what made that default less convenient.
I live in Germany now, so I generally set it to Irish nowadays. Since I like ISO-style enter key, I use UK keyboard layout (also easier to switch to Turkish than ANSI-layout). However many OSes now have a English (Europe) locale too
Tying currency to locale seems insane. I have bank accounts in multiple currencies and use both several times per week. Why does all software on my system need to have a default currency? Most software does not care about money, those that do usually give you a quote in a currency fixed by someone else.
I thought locale is mostly controlled by the environment. So you can run your system and each program with it's own separate locale settings if you like.
> If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
There is a deeper bug within Unicode.
The Turkish letter TURKISH CAPITAL LETTER DOTLESS I is represented as the code point U+0049, which is named LATIN CAPITAL LETTER I.
The Greek letter GREEK CAPITAL LETTER IOTA is represented as the code point U+0399, named... GREEK CAPITAL LETTER IOTA.
The relationship between the Greek letter I and the Roman letter I is identical in every way to the relationship between the Turkish letter dotless I and the Roman letter I. (Heck, the lowercase form is also dotless.) But lowercasing works on GREEK CAPITAL LETTER IOTA because it has a code point to call its own.
Should iota have its own code point? The answer to that question is "no": it is, by definition, drawn identically to the ascii I. But Unicode has never followed its principles. This crops up again and again and again, everywhere you look. (And, in "defense" of Unicode, it has several principles that directly contradict each other.)
Then people come to rely on behavior that only applies to certain buggy parts of Unicode, and get messed up by parts that don't share those particular bugs.
It’s not a bug, it’s a feature. The reason is that ISO 8859-7 [0] used for Greek has a separate character code for Iota (for all greek letters, really),
while ISO 8859-3 [1] and -9 [2] used for Turkish do not for the usual dotless uppercase I.
One important goal of Unicode is to be able to convert from existing character sets to Unicode (and back) without having to know the language of the text that is being converted. If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
When I saw "Turkish alphabet bug", I just knew it was some version of toLower() gone horribly wrong.
(I'm sure there's a good reason, but I find it odd that compiler message tags are invariably uppercase, but in this problem code they lowercased it to go do a lookup from an enum of lowercase names. Why isn't the enum uppercase, like the things you're going to lookup?)
With Turkish you can't safely case-fold with toupper() or tolower() in a C/US locale: i->I and I->i are both wrong. Uppercasing wouldn't work. You have to use Unicode or Latin-5 to manage it.
I am one of the maintainers is the Scala compiler, and this is one of the things that immediately jump to me when I review code that contains any casing operation. Always explicitly specify the locale. However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
Also, this is the last remaining major system-dependent default in Java. They made strict floating point the default in 17; UTF-8 the default encoding some versions later (21?); only the locale remains. I hope they make ROOT the default in an upcoming version.
FWIW, in the Scala.js implementation, we've been using UTF-8 and ROOT as the defaults forever.
> However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
I have no idea what `Locale.ROOT` refers to, and I'd be worried that it's accidentally the same as the system locale or something, exactly the sort of thing that will unexpectedly change when a Turkish-speaker uses a computer or what have you.
> I'd be worried that it's accidentally the same as the system locale or something
The API docs clearly specify that Locale.ROOT “is regarded as the base locale of all locales, and is used as the language/country neutral locale for the locale sensitive operations.”
> However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
Isn't it kind of strange to say that Locale.US is too US centric, and therefore we'll invent a new, fictitious locale, the contents of which is all the US defaults, but which we'll call "the base locale of all locales"? That somehow seems even more US centric to me than just saying Locale.US.
Setting the locale as Locale.US is at least comprehensible at a glance.
I guess it's one way to look at it. I see it as: I want a reproducible locale, independent of the user's system. If I see US, I'm wondering if it was chosen to be English because the program was written in English. When I localize the program, should I make that locale configurable? ROOT communicates that it must not be configurable, and never dependent on the system.
Ugh, I've had the exact same problem in a Java project, which meant I had to go through thousands and thousands of lines of code and make sure that all 'toLowerCase()' on enum names included Locale.ENGLISH as parameter.
As the article demonstrates, the error manifests in a completely inscrutable way. But once I saw the bug from a couple of users with Turkish sounding names, I zeroed in on it. And cursed a few times under my breath whoever messed up that character table so bad.
In C# programming, you are able to specify a culture every time you call a function such as numbers <-> strings, or case conversion. Or you specify the "Invariant Culture", which is basically US English. But the default culture is still based on your system's locale, you need to explicitly name the invariant culture everywhere. Because it involves a lot of filling in parameters for many different functions, people often leave it out, then their code breaks on systems where "," is the decimal separator.
You can also change the default culture to the invariant culture and save all the headaches. Save the localized number conversion and such for situations where you actually need to interact with localized values.
I have always wondered why Turkey chose to Latinize in this way. I understand that the issue is having two similar vowels in Turkish, but not why they decided to invent the dotless I, when other diacritics already existed. Ĭ Î Ï Í Ì Į Ĩ and almost certainly a dozen other would've worked, unless there was already some significance to the dot in Turkish that's not obvious.
Computers and localisation weren't relevant back in the early 20th century. The dotless existed before the dotted i (in Greek script as iota). Some European scholars putting an extra dot on the letter to make it stand out a bit more are as much to blame as the Turks for making the distinction between the different i-vowels clear.
Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
I don't know... I understand the history and reasons for this capitalization behavior in Turkish, and my native language isn't English, which had to use a lot of strange encodings before the introduction of UTF-8.
But messing around with the capitalization of ASCII <= codepoint(127) is a risky business, in my opinion. These codepoints are explicitly named:
"LATIN CAPITAL LETTER I"
"LATIN SMALL LETTER I"
and requiring them to not match exactly during capitalization/diminuitization sounds very risky.
It's not exactly programmers failing to take into account that no everybody writes in English - if that were the case, then it would simply be impossible to represent the Turkish lowercase-dotless and uppercase-dotted I at all. The actual problem is failing to take into account that operations on text strings that work in one language's writing might not work the same way in a different language's writing system. There's a lot of languages in the world that use the Latin writing system, and even if you are personally a fluent speaker and writer of several of them, you might simply have not learned about Turkish's specific behavior with I.
> Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
This bug is the exact opposite of that. The program would have worked fine had it used pure ASCII transforms (±0x20); it was the use of library functions that did in fact take Turkish into account that caused the problem.
More broadly, this is not an easy issue to solve. If a Turkish programmer writes code, what is the expected behaviour for metaprogramming and compilers? Are the function names in English or Turkish? What about variables, object members, struct fields? You could have one variable name that references some government ID number using its native Turkish name, right next to another variable name that uses the English "ID". How does the compiler know what locale to use for which symbol?
Boiling all of this down to 'just be more considerate' is not actually constructive or actionable.
The issue is not the invention of the dotless I, it already exists, the issue is that the took a vowerl , i/I, and the assigned the lower case to one vowel, and the upper case to a different one, and invented what left missing.
It's like they decided that the uppercase of "a" is "E" and the uppercase of "e" is "A".
This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case. i/I represents one vowel in _English_, when written with a latin script. ̶I̶n̶ ̶f̶a̶c̶t̶ ̶e̶v̶e̶n̶ ̶t̶h̶i̶s̶ ̶i̶s̶n̶'̶t̶ ̶c̶o̶r̶r̶e̶c̶t̶,̶ ̶i̶/̶I̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶s̶ ̶o̶n̶e̶ ̶p̶h̶o̶n̶e̶m̶e̶,̶ ̶n̶o̶t̶ ̶o̶n̶e̶ ̶v̶o̶w̶e̶l̶.̶ <see troad's comment for correction>
There is no reason to assume that the English representation is in general "correct", "standard", or even "first". The modern script for Turkish was adopted around the 1920's, so you could argue perhaps that most typewriters presented a standard that should have been followed. However, there was variation even between different typewriters, and I strongly suspect that typewriters weren't common in Turkey when the change was made.
> In fact even this isn't correct, i/I represents one phoneme, not one vowel.
Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
> There is no reason to assume that the English representation is in general "correct", "standard", or even "first".
Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
No one is saying Turkish cannot break from that convention - they can feel free to do anything they like - but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
> but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
I don't think it's fair to call it predictable. When this convention was chosen, the problem of "what is the uppercase letter to I" was always bound to the context of language. Now it suddenly isn't. Shikata ga nai. It wasn't even an explicit assumption that can be reflected upon, it was an implicit one, that just happened.
> Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
You're right, apologies my linguistics is rusty and I was overconfident.
> Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
I think my main argument is that the importance of standardizing to i/I was much less obvious in the 1920's. The benefits are obvious to us now, but I think we would be hard pressed to predict this outcome a-priori.
This may be correct, I'd have to do a 'real' search, which I'm too lazy to do, lol sorry. However there are definitely other (non-latin) scripts that have either i or I, but for which i/I is not a correct pair. For example, greek has ι/Ι too.
Nope, we decided to do it the correct and logical way for our alphabet. Some glyphs are either dotted or dotless. So, we have Iı, İi, Oo, Öö, Uu, Üü, Cc, Çç, Ss and Şş. You see the Ii pair is actually the odd one in the series.
Also, we don't have serifs in our I. It's just a straight line. So, it's not even related to your Ii pair in English. You can't dictate how we write our straight lines, can you.
The root cause of the problem is in the implementation and standardization of the computer systems. Computers are originally designed only for English alphabet in mind. And patched to support other languages over time, poorly. Computers should obey the language rules, not the other way around.
>So, it's not even related to your Ii pair in English.
Modern Turkish uses the Latin script, of course it's related.
>You can't dictate how we write our straight lines, can you.
No, I can't, I just want to understand why the Turks decided to change this letter, and this letter only, from the rest of the standard Latin script/diacritics.
> Computers are originally designed only for English alphabet in mind.
Computers are originally designed for no alphabet at all. They only have two symbols.
ASCII is a set of operating codes that includes instructions to physically move different parts of a mechanical typewriter. It was already a mistake when it was used for computer displays.
Note that ASCII stands for "American Standard Code for Information Interchange". There's no expectation here that this is a suitable code for any language other than English, the de-facto language of the United States of America.
I don’t think that’s the right way to think about it. It’s not like they were Latinizing Turkish with ASCII in mind. They wanted a one-to-one mapping between letters and sounds. The dot versus no dot marks where in your mouth or throat the vowel is formed. They didn’t have this concept that capital I automatically pairs with lowercase i. The dot was always part of the letter itself. The reform wasn’t trying to fit existing Western conventions, it was trying to map the Turkish sounds to symbols.
They switched from Arabic script to Latin script. They literally did latinize Turkish, but they ditched the convention of 1 to 1 correspondence between lowercase and uppercase letters that is invariant across all languages that use Latin script except for German script, Turkish script and its offspring Azerbaijani script.
Not really. Turkish has a feature that is called "vowel harmony". You match suffixes you add to a word based on a category system: low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
Ö and ü were already borrowed from German alphabet. Umlaut-added variants of 'ö' and 'ü' have a similar effect on 'o' and 'u' respectively: they bring a back vowel to front. See: https://en.wikipedia.org/wiki/Vowel . Similarly removing the dots bring them back.
Turkish already had i sound and its back variant which is a schwa-like sound: https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It has the same relation in IPA as 'ö' has to 'o' and 'ü' has to 'u'. Since the makers of the Turkish variant of Latin Alphabet had the rare chance of making a regular pronunciation system with the state of the language and since removing the dots had the effect of making a front vowel a back vowel, they simply copied this feature from ö and ü to i:
Just remove the dots to make it a back vowel! Now we have ı.
When comes to capitalization, ö becomes Ö, ü becomes Ü. So it is just logical to make the capital of i İ and the lowercase of I ı.
Yes it's hard to come up with a different capital than I unless you somehow can see into the future and foresee the advent of computers, which the Turkish alphabet reform predates.
Of course the latin capital I is dotless because originally the lowercase latin "i" was also dotless. The dot has been added later to make text more legible.
> low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
Does that reflect the Turkish terminology? Ordinarily you would call o and u "high" while a and e are "low". The distinction between o/u and ö/ü is the other dimension: o/u are "back" while ö/ü are "front".
Yes. The Turkish terms are "kalın ünlü" and "ince ünlü". They literally translate to "low pitch wovel"/"high pitch wovel" )(or "thick wovel"/"thin wovel") in this context.
There is a second wovel harmony rule [1] (called lesser wovel harmony) that makes the distinction you pointed out. Letters a/e/ı/i are called flat wovels, and o/ö/u/ü are called round wovels.
There was actually three! i (as in th[i]s), î (as in ch[ee]se) and ı which sounds nothing like the first two, it sounds something like the e in bag[e]l. I guess it sounded so different that it warranted such a drastic symbolic change.
Turkish exhibits a vowel harmony system and uses diacritics on other vowels too and the choice to put "i" together with other front vowels like "ü" and "ö" and put "ı" together with back vowels like "u" and "o" is actually pretty elegant.
The latinization reform of the Turkish language predates computers and it was hard to foresee the woes that future generations would have had with that choice
Turkish i/İ sounds pretty similar to most of the European languages. Italian, French and German pronounce it pretty similar. Also removing umlauts from the other two vowels ö and ü to write o and u has the same effect as removing the dot from i. It is just consistent.
No, what I mean is, o and u get an umlaut (two dots) to become ö and ü, but i doesn't get an umlaut, it's just a single dot from ı to i. Why not make it i and ï? That would be more consistent, in my opinion.
I guess the aim was to reuse as much of the standard Latin alphabet as possible.
A better solution would have been to leave i/I as they are (similar to j/J), and introduce a new lowercase/uppercase letter pair for "ı", such as Iota (ɩ/Ɩ).
This was shortly after the Turkish War of Independence. Illiteracy was quite high (estimated at over 85%) and the country was still being rebuilt. My guess is they did their best to represent all the sounds while creating a one to one mapping between sounds and letters but also not deviating too much from familiar forms. There were probably conflicting goals so inconsistencies were bound to happen.
That would be the opposite of consistency; i is the front vowel and ı is the back one.
Note that the vowel /i/ cannot umlaut, because it's already a front vowel. The ï you cite comes from French, where the two dots represent diaeresis rather than umlaut. When umlaut is a feature of your language, combining the notation like that isn't likely to be a good idea.
I was scrolling and scrolling, waiting for the author to mention the new methods, which of course every Android Dev had to migrate to at some point. And 99% of us probably thought how annoying this change is, even though it probably reduced the number of bugs for Turkish users :)
Unrelated, but a month ago I found a weird behaviour where in a kotlin scratch file, `List.isEmpty()` is always true. Questioned my sanity for at least an hour there... https://youtrack.jetbrains.com/issue/KTIJ-35551/
Ramazan Çalçoban sent his estranged wife Emine the text message:
Zaten sen sıkışınca konuyu değiştiriyorsun.
"Anyhow, whenever you can't answer an argument, you change the subject."
Unfortunately, what she thought he wrote was:
Zaten sen sikişınce konuyu değiştiriyorsun.
"Anyhow, whenever they are fucking you, you change the subject."
This led to a fight in which the woman was stabbed and died and the man committed suicide in prison.
Wouldn't at least the first issue be solved by using Unicode case folding instead of lowercase? Python, for example, has separate .casefold() and .lower() methods, and AFAIK casefold would always turn I into i, and is much more appropriate for this use case.
Everyone who has used Java has hit this before. Java really should force people to always specify the locale and get rid of the versions of the functions without locale parameters. There is so much hidden broken code out there.
That only helps if devs specify an invariant locale (ROOT for Java) where needed. In practice, I think you'll see devs blindly using using the user's current locale like it silently does today.
The invariant locale can't parse the numbers I enter (my locale uses comma as a decimal separator). More than a few applications will reject perfectly valid numbers. Intel's driver control panel was even so fucked up that I needed to change my locale to make it parse its own UI layout files.
Defaulting to ROOT makes a lot of sense for internal constants, like in the example in this article, but defaulting to ROOT for everything just exposes the problems that caused Sun to use the user locale by default in the first place.
Agreed, there are cases where user locale is needed. So many so that I expect that to be devs’ default if required to specify, and that they won’t use ROOT where they should.
As a Turkish speaker who was using a Turkish-locale setup in my teenage years these kinds of bugs frustrated me infinitely. Half of the Java or Python apps I installed never run. My PHP webservers always had problems with random software. Ultimately, I had to change my system's language to English. However, US has godawful standards for everything: dates, measurement units, paper sizes.
When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.
If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.
> However, US has godawful standards for everything: dates, measurement units, paper sizes.
Isn't the choice of language and date and unit formats normally independent.
There are OS-level settings for date and unit formats but not all software obeys that, instead falling back to using the default date/unit formats for the selected locale.
They’re about as independent as system language defaults causing software not to work properly. It’s that whole realm of “well we assumed that…” design error.
> > However, US has godawful standards for everything: dates, measurement units, paper sizes.
> Isn't the choice of language and date and unit formats normally independent.
You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).
On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.
My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!
> FFS WTF is a stone
An english imperial measurement. Measurements made based on actual stone rock and were mainly use as weighing agricultural items such as animal meat and potatoes. We also used tons and pounds before we incorporated the metric system of Europe.
A stone is 1/8th of a long hundredweight. Easy!
> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.
POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.
If it's offered, choose EN-Australian or EN-international. Then you get sensible dates and measurement units.
And if you want it to be more sensible but still not sensible, pick EN-ca.
> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake.
It wasn’t a mistake for local software that is supposed to automatically use the user’s locale. It’s what made a lot of local software usefully locale-sensitive without the developer having to put much effort into it, or even necessarily be aware of it. It’s the reason why setting the LC_* environment variables on Linux has any effect on most software.
The age of server software, and software talking to other systems, is what made that default less convenient.
use Australian English: English but with same settings for everything else, including keyboard layout
I live in Germany now, so I generally set it to Irish nowadays. Since I like ISO-style enter key, I use UK keyboard layout (also easier to switch to Turkish than ANSI-layout). However many OSes now have a English (Europe) locale too
Many Linux distributions provide en_DK specifically for this purpose. English as it is used in Denmark. :-)
Denmark doesn't have Euros as currency, unfortunately.
Tying currency to locale seems insane. I have bank accounts in multiple currencies and use both several times per week. Why does all software on my system need to have a default currency? Most software does not care about money, those that do usually give you a quote in a currency fixed by someone else.
I thought locale is mostly controlled by the environment. So you can run your system and each program with it's own separate locale settings if you like.
> If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.
There is a deeper bug within Unicode.
The Turkish letter TURKISH CAPITAL LETTER DOTLESS I is represented as the code point U+0049, which is named LATIN CAPITAL LETTER I.
The Greek letter GREEK CAPITAL LETTER IOTA is represented as the code point U+0399, named... GREEK CAPITAL LETTER IOTA.
The relationship between the Greek letter I and the Roman letter I is identical in every way to the relationship between the Turkish letter dotless I and the Roman letter I. (Heck, the lowercase form is also dotless.) But lowercasing works on GREEK CAPITAL LETTER IOTA because it has a code point to call its own.
Should iota have its own code point? The answer to that question is "no": it is, by definition, drawn identically to the ascii I. But Unicode has never followed its principles. This crops up again and again and again, everywhere you look. (And, in "defense" of Unicode, it has several principles that directly contradict each other.)
Then people come to rely on behavior that only applies to certain buggy parts of Unicode, and get messed up by parts that don't share those particular bugs.
It’s not a bug, it’s a feature. The reason is that ISO 8859-7 [0] used for Greek has a separate character code for Iota (for all greek letters, really), while ISO 8859-3 [1] and -9 [2] used for Turkish do not for the usual dotless uppercase I.
One important goal of Unicode is to be able to convert from existing character sets to Unicode (and back) without having to know the language of the text that is being converted. If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
[0] https://en.wikipedia.org/wiki/ISO/IEC_8859-7
[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-3
[2] https://en.wikipedia.org/wiki/ISO/IEC_8859-9
I know that. That's why I mentioned
> in "defense" of Unicode, it has several principles that directly contradict each other
Unicode wants to do several things, and they aren't mutually compatible. It is premised on the idea that you can be all things to all people.
> It’s not a bug, it’s a feature.
It is a bug. It directly violates Unicode's stated principles. It's also a feature, but that won't make it not a bug.
When I saw "Turkish alphabet bug", I just knew it was some version of toLower() gone horribly wrong.
(I'm sure there's a good reason, but I find it odd that compiler message tags are invariably uppercase, but in this problem code they lowercased it to go do a lookup from an enum of lowercase names. Why isn't the enum uppercase, like the things you're going to lookup?)
With Turkish you can't safely case-fold with toupper() or tolower() in a C/US locale: i->I and I->i are both wrong. Uppercasing wouldn't work. You have to use Unicode or Latin-5 to manage it.
> Why isn't the enum uppercase, like the things you're going to lookup?
Another question: why does the log record the string you intended to look up, instead of the string you actually did look up?
I am one of the maintainers is the Scala compiler, and this is one of the things that immediately jump to me when I review code that contains any casing operation. Always explicitly specify the locale. However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
Also, this is the last remaining major system-dependent default in Java. They made strict floating point the default in 17; UTF-8 the default encoding some versions later (21?); only the locale remains. I hope they make ROOT the default in an upcoming version.
FWIW, in the Scala.js implementation, we've been using UTF-8 and ROOT as the defaults forever.
> However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
I have no idea what `Locale.ROOT` refers to, and I'd be worried that it's accidentally the same as the system locale or something, exactly the sort of thing that will unexpectedly change when a Turkish-speaker uses a computer or what have you.
> I'd be worried that it's accidentally the same as the system locale or something
The API docs clearly specify that Locale.ROOT “is regarded as the base locale of all locales, and is used as the language/country neutral locale for the locale sensitive operations.”
> However, unlike TFA and other comments, I don't suggest `Locale.US`. That's a little too US-centric. The canonical locale is in fact `Locale.ROOT`. Granted, in practice it's equivalent, but I find it a little bit more sensible.
Isn't it kind of strange to say that Locale.US is too US centric, and therefore we'll invent a new, fictitious locale, the contents of which is all the US defaults, but which we'll call "the base locale of all locales"? That somehow seems even more US centric to me than just saying Locale.US.
Setting the locale as Locale.US is at least comprehensible at a glance.
I guess it's one way to look at it. I see it as: I want a reproducible locale, independent of the user's system. If I see US, I'm wondering if it was chosen to be English because the program was written in English. When I localize the program, should I make that locale configurable? ROOT communicates that it must not be configurable, and never dependent on the system.
It is a programming language agnostic equivalent of POSIX C locale with Unicode enhancement.
Ugh, I've had the exact same problem in a Java project, which meant I had to go through thousands and thousands of lines of code and make sure that all 'toLowerCase()' on enum names included Locale.ENGLISH as parameter.
As the article demonstrates, the error manifests in a completely inscrutable way. But once I saw the bug from a couple of users with Turkish sounding names, I zeroed in on it. And cursed a few times under my breath whoever messed up that character table so bad.
Were you not using static analysis tools? All of the popular ones will warn about that issue with locales.
In C# programming, you are able to specify a culture every time you call a function such as numbers <-> strings, or case conversion. Or you specify the "Invariant Culture", which is basically US English. But the default culture is still based on your system's locale, you need to explicitly name the invariant culture everywhere. Because it involves a lot of filling in parameters for many different functions, people often leave it out, then their code breaks on systems where "," is the decimal separator.
You can also change the default culture to the invariant culture and save all the headaches. Save the localized number conversion and such for situations where you actually need to interact with localized values.
I have always wondered why Turkey chose to Latinize in this way. I understand that the issue is having two similar vowels in Turkish, but not why they decided to invent the dotless I, when other diacritics already existed. Ĭ Î Ï Í Ì Į Ĩ and almost certainly a dozen other would've worked, unless there was already some significance to the dot in Turkish that's not obvious.
Computers and localisation weren't relevant back in the early 20th century. The dotless existed before the dotted i (in Greek script as iota). Some European scholars putting an extra dot on the letter to make it stand out a bit more are as much to blame as the Turks for making the distinction between the different i-vowels clear.
Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
> that not everybody writes in English.
I don't know... I understand the history and reasons for this capitalization behavior in Turkish, and my native language isn't English, which had to use a lot of strange encodings before the introduction of UTF-8.
But messing around with the capitalization of ASCII <= codepoint(127) is a risky business, in my opinion. These codepoints are explicitly named:
"LATIN CAPITAL LETTER I" "LATIN SMALL LETTER I"
and requiring them to not match exactly during capitalization/diminuitization sounds very risky.
It's not exactly programmers failing to take into account that no everybody writes in English - if that were the case, then it would simply be impossible to represent the Turkish lowercase-dotless and uppercase-dotted I at all. The actual problem is failing to take into account that operations on text strings that work in one language's writing might not work the same way in a different language's writing system. There's a lot of languages in the world that use the Latin writing system, and even if you are personally a fluent speaker and writer of several of them, you might simply have not learned about Turkish's specific behavior with I.
> Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.
This bug is the exact opposite of that. The program would have worked fine had it used pure ASCII transforms (±0x20); it was the use of library functions that did in fact take Turkish into account that caused the problem.
More broadly, this is not an easy issue to solve. If a Turkish programmer writes code, what is the expected behaviour for metaprogramming and compilers? Are the function names in English or Turkish? What about variables, object members, struct fields? You could have one variable name that references some government ID number using its native Turkish name, right next to another variable name that uses the English "ID". How does the compiler know what locale to use for which symbol?
Boiling all of this down to 'just be more considerate' is not actually constructive or actionable.
The issue is not the invention of the dotless I, it already exists, the issue is that the took a vowerl , i/I, and the assigned the lower case to one vowel, and the upper case to a different one, and invented what left missing.
It's like they decided that the uppercase of "a" is "E" and the uppercase of "e" is "A".
This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case. i/I represents one vowel in _English_, when written with a latin script. ̶I̶n̶ ̶f̶a̶c̶t̶ ̶e̶v̶e̶n̶ ̶t̶h̶i̶s̶ ̶i̶s̶n̶'̶t̶ ̶c̶o̶r̶r̶e̶c̶t̶,̶ ̶i̶/̶I̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶s̶ ̶o̶n̶e̶ ̶p̶h̶o̶n̶e̶m̶e̶,̶ ̶n̶o̶t̶ ̶o̶n̶e̶ ̶v̶o̶w̶e̶l̶.̶ <see troad's comment for correction>
There is no reason to assume that the English representation is in general "correct", "standard", or even "first". The modern script for Turkish was adopted around the 1920's, so you could argue perhaps that most typewriters presented a standard that should have been followed. However, there was variation even between different typewriters, and I strongly suspect that typewriters weren't common in Turkey when the change was made.
> In fact even this isn't correct, i/I represents one phoneme, not one vowel.
Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
> There is no reason to assume that the English representation is in general "correct", "standard", or even "first".
Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
No one is saying Turkish cannot break from that convention - they can feel free to do anything they like - but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
> but the resulting issues are fairly predictable, and their adverse effects fall mainly on Turkish speakers in practice, not on the rest of us.
I don't think it's fair to call it predictable. When this convention was chosen, the problem of "what is the uppercase letter to I" was always bound to the context of language. Now it suddenly isn't. Shikata ga nai. It wasn't even an explicit assumption that can be reflected upon, it was an implicit one, that just happened.
> Not quite. In English, 'i' and 'I' are two allographs of one grapheme, corresponding to many phonemes, based on context. (Using linguistic definitions here, not compsci ones.) The 'i's in 'kit' and 'kite' stand for different phonemes, for example.
You're right, apologies my linguistics is rusty and I was overconfident.
> Correct, but the I/i allography is not exclusive to English. Every Latin script functions that way, other than Turkish and Turkish-derived scripts.
I think my main argument is that the importance of standardizing to i/I was much less obvious in the 1920's. The benefits are obvious to us now, but I think we would be hard pressed to predict this outcome a-priori.
>This is misleading, because it assumes that i/I naturally represent one vowel, which is just not the case.
It does in literally any language using a latin alphabet other than Turkish.
This may be correct, I'd have to do a 'real' search, which I'm too lazy to do, lol sorry. However there are definitely other (non-latin) scripts that have either i or I, but for which i/I is not a correct pair. For example, greek has ι/Ι too.
All other Turkic languages also copied this for their Latin script: https://en.wikipedia.org/wiki/Dotless_I
Nope, we decided to do it the correct and logical way for our alphabet. Some glyphs are either dotted or dotless. So, we have Iı, İi, Oo, Öö, Uu, Üü, Cc, Çç, Ss and Şş. You see the Ii pair is actually the odd one in the series.
Also, we don't have serifs in our I. It's just a straight line. So, it's not even related to your Ii pair in English. You can't dictate how we write our straight lines, can you.
The root cause of the problem is in the implementation and standardization of the computer systems. Computers are originally designed only for English alphabet in mind. And patched to support other languages over time, poorly. Computers should obey the language rules, not the other way around.
>Also, we don't have serifs in our I.
That depends on font.
>So, it's not even related to your Ii pair in English.
Modern Turkish uses the Latin script, of course it's related.
>You can't dictate how we write our straight lines, can you.
No, I can't, I just want to understand why the Turks decided to change this letter, and this letter only, from the rest of the standard Latin script/diacritics.
> Computers are originally designed only for English alphabet in mind.
Computers are originally designed for no alphabet at all. They only have two symbols.
ASCII is a set of operating codes that includes instructions to physically move different parts of a mechanical typewriter. It was already a mistake when it was used for computer displays.
Note that ASCII stands for "American Standard Code for Information Interchange". There's no expectation here that this is a suitable code for any language other than English, the de-facto language of the United States of America.
Does the situation change in Unicode?
I don’t think that’s the right way to think about it. It’s not like they were Latinizing Turkish with ASCII in mind. They wanted a one-to-one mapping between letters and sounds. The dot versus no dot marks where in your mouth or throat the vowel is formed. They didn’t have this concept that capital I automatically pairs with lowercase i. The dot was always part of the letter itself. The reform wasn’t trying to fit existing Western conventions, it was trying to map the Turkish sounds to symbols.
They switched from Arabic script to Latin script. They literally did latinize Turkish, but they ditched the convention of 1 to 1 correspondence between lowercase and uppercase letters that is invariant across all languages that use Latin script except for German script, Turkish script and its offspring Azerbaijani script.
Not really. Turkish has a feature that is called "vowel harmony". You match suffixes you add to a word based on a category system: low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
Ö and ü were already borrowed from German alphabet. Umlaut-added variants of 'ö' and 'ü' have a similar effect on 'o' and 'u' respectively: they bring a back vowel to front. See: https://en.wikipedia.org/wiki/Vowel . Similarly removing the dots bring them back.
Turkish already had i sound and its back variant which is a schwa-like sound: https://en.wikipedia.org/wiki/Close_back_unrounded_vowel . It has the same relation in IPA as 'ö' has to 'o' and 'ü' has to 'u'. Since the makers of the Turkish variant of Latin Alphabet had the rare chance of making a regular pronunciation system with the state of the language and since removing the dots had the effect of making a front vowel a back vowel, they simply copied this feature from ö and ü to i:
Just remove the dots to make it a back vowel! Now we have ı.
When comes to capitalization, ö becomes Ö, ü becomes Ü. So it is just logical to make the capital of i İ and the lowercase of I ı.
Yes it's hard to come up with a different capital than I unless you somehow can see into the future and foresee the advent of computers, which the Turkish alphabet reform predates.
Of course the latin capital I is dotless because originally the lowercase latin "i" was also dotless. The dot has been added later to make text more legible.
> low pitch vs high pitch vowels where a,ı,o,u are low pitch and e,i,ö,ü are high pitch.
Does that reflect the Turkish terminology? Ordinarily you would call o and u "high" while a and e are "low". The distinction between o/u and ö/ü is the other dimension: o/u are "back" while ö/ü are "front".
> Does that reflect the Turkish terminology?
Yes. The Turkish terms are "kalın ünlü" and "ince ünlü". They literally translate to "low pitch wovel"/"high pitch wovel" )(or "thick wovel"/"thin wovel") in this context.
There is a second wovel harmony rule [1] (called lesser wovel harmony) that makes the distinction you pointed out. Letters a/e/ı/i are called flat wovels, and o/ö/u/ü are called round wovels.
[1] https://georgiasomethingyouknowwhatever.wordpress.com/2015/0...
There was actually three! i (as in th[i]s), î (as in ch[ee]se) and ı which sounds nothing like the first two, it sounds something like the e in bag[e]l. I guess it sounded so different that it warranted such a drastic symbolic change.
Turkish exhibits a vowel harmony system and uses diacritics on other vowels too and the choice to put "i" together with other front vowels like "ü" and "ö" and put "ı" together with back vowels like "u" and "o" is actually pretty elegant.
The latinization reform of the Turkish language predates computers and it was hard to foresee the woes that future generations would have had with that choice
Except for the a/e pair, front and back vowels have dotted and dotless versions in Turkish: ı and i, o and ö, u and ü.
Makes sense enough, but why not use i and ï to be consistent?
Turkish i/İ sounds pretty similar to most of the European languages. Italian, French and German pronounce it pretty similar. Also removing umlauts from the other two vowels ö and ü to write o and u has the same effect as removing the dot from i. It is just consistent.
No, what I mean is, o and u get an umlaut (two dots) to become ö and ü, but i doesn't get an umlaut, it's just a single dot from ı to i. Why not make it i and ï? That would be more consistent, in my opinion.
I guess the aim was to reuse as much of the standard Latin alphabet as possible.
A better solution would have been to leave i/I as they are (similar to j/J), and introduce a new lowercase/uppercase letter pair for "ı", such as Iota (ɩ/Ɩ).
This was shortly after the Turkish War of Independence. Illiteracy was quite high (estimated at over 85%) and the country was still being rebuilt. My guess is they did their best to represent all the sounds while creating a one to one mapping between sounds and letters but also not deviating too much from familiar forms. There were probably conflicting goals so inconsistencies were bound to happen.
In that case they should've used ï for consistency.
That would be the opposite of consistency; i is the front vowel and ı is the back one.
Note that the vowel /i/ cannot umlaut, because it's already a front vowel. The ï you cite comes from French, where the two dots represent diaeresis rather than umlaut. When umlaut is a feature of your language, combining the notation like that isn't likely to be a good idea.
It’s always Turkish lol. That was our language of choice to QA anything… if it worked on that it would pretty much work on anything.
I'm shocked there's no mention of "The Turkey Test"
https://blog.codinghorror.com/whats-wrong-with-turkey/
I was scrolling and scrolling, waiting for the author to mention the new methods, which of course every Android Dev had to migrate to at some point. And 99% of us probably thought how annoying this change is, even though it probably reduced the number of bugs for Turkish users :)
Unrelated, but a month ago I found a weird behaviour where in a kotlin scratch file, `List.isEmpty()` is always true. Questioned my sanity for at least an hour there... https://youtrack.jetbrains.com/issue/KTIJ-35551/
well now I wanna know what's going on there!
Could have been worse --
This led to a fight in which the woman was stabbed and died and the man committed suicide in prison.https://gizmodo.com/a-cellphones-missing-dot-kills-two-peopl...
Wouldn't at least the first issue be solved by using Unicode case folding instead of lowercase? Python, for example, has separate .casefold() and .lower() methods, and AFAIK casefold would always turn I into i, and is much more appropriate for this use case.
Everyone who has used Java has hit this before. Java really should force people to always specify the locale and get rid of the versions of the functions without locale parameters. There is so much hidden broken code out there.
That only helps if devs specify an invariant locale (ROOT for Java) where needed. In practice, I think you'll see devs blindly using using the user's current locale like it silently does today.
The invariant locale can't parse the numbers I enter (my locale uses comma as a decimal separator). More than a few applications will reject perfectly valid numbers. Intel's driver control panel was even so fucked up that I needed to change my locale to make it parse its own UI layout files.
Defaulting to ROOT makes a lot of sense for internal constants, like in the example in this article, but defaulting to ROOT for everything just exposes the problems that caused Sun to use the user locale by default in the first place.
Agreed, there are cases where user locale is needed. So many so that I expect that to be devs’ default if required to specify, and that they won’t use ROOT where they should.
Kotlin keywords should be assumed to be English.
Logging levels are not language keywords.
Java; write once, run anywhere, except on Turkish Windows.
A stark reminder that all operations on strings are wrong.
Or that strings are not human texts.