▲Type-safe, K-sortable, globally unique identifier inspired by Stripe IDsgithub.com

418 points by dloreto 743 days ago | 221 comments

klabb3 743 days ago [-]

A couple of suggestions:

Lock down the prefix string now before it’s too late and document it. I see in Go that it’s lowercase ascii, which seems fine except for compound types (like “article-comment”). May be worth looking at allowing a single separator given that many complex projects (and ORMs) can’t avoid them.

The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

For Go, I’d align with Googles UUID implementation, with proper parse functions and an internal byte array instead of strings. Strings are for rendering (and in your case, the prefix). Right now, it looks like the parsing is too permissive, and goes into generation mode if the suffix is empty. And the SplitN+index thing will panic if no underscores, no? Anyway, tests will tell.

As for the actual design decisions, I tried to poke holes but I fold! I think this strikes the sweet spot between the different tradeoffs. Well done!

dloreto 743 days ago [-]

Thanks for the feedback!

We have tests for the base32 encoding which is the most complicated part of the implementation (https://github.com/jetpack-io/typeid-go/blob/main/base32/bas...) but your point stands. We'll add a more rigorous test suite (particularly as the number of implementations across different languages grows, and we want to make sure all the implementations are compatible with each other)

Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

dolmen 743 days ago [-]

There is no tests.

There is just a single test. Which only tests the decoding of a single known value. No encoding test.

Go has infrastructure for benchmarking and fuzzing. Use it!

dloreto 742 days ago [-]

We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...

Thanks for the feedback!

klabb3 743 days ago [-]

> We have tests for the base32 encoding which is the most complicated part of the implementation

I didn’t look into it much but it seems like a great encoding even outside of this project. Predictable length, reasonable density, “double clickable” etc. I’ve been annoyed with both hex and base64 for a while so it’s pretty cool just by itself.

> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

Yeah, the worry is almost entirely “subtle deviations across stacks”, which is usually due to ambiguous specs. It’s so annoying when there’s minor differences, compatibility options etc (like base64 which has another “URL-friendly” encoding - ugh).

aftbit 743 days ago [-]

My personal favorite encoding is base58 aka Bitcoin address encoding. It uses characters [A-Za-z0-9] except for [0OIl]. It is almost as dense as base64, "double clickable", but not (as) predictable in length as base32.

It was chosen to avoid a number of the most annoying ambiguous letter shapes for hand-entry of long address strings.

https://en.bitcoin.it/wiki/Base58Check_encoding

bredren 743 days ago [-]

Reminds me that Windows activation keys used to exclude a broader set of characters to avoid transcription errors: looking it up again: 0OI1 and 5AELNSUZ

unshavedyak 743 days ago [-]

Is there a good reason why Base63 (nopad) doesn't exist? Ie Base64 minus the `-`, so that you almost get the density of base64 (nopad) but the double click friendly feature.

I was reviewing encodings recently and didn't want to drop all the way down to base32, but for some reason the library i was using didn't allow anything beyond base32 and bas64 variants, despite having a feature where you can define your own base.

I thought maybe it was performance oriented. An odd prefix length like base63 would mean .. i think, a slightly more computationally demanding set of encoding instructions?

Either way i basically want base58 but i don't care about legibility, i just wanted double click and url friendly characters.

fauigerzigerk 743 days ago [-]

>Is there a good reason why Base63 (nopad) doesn't exist? Ie Base64 minus the `-`, so that you almost get the density of base64 (nopad) but the double click friendly feature.

Yes, the reason is that you need 64 characters if you want each character to encode 6 bits as log2(64) == 6. If you only have 63 characters in your alphabet then one of your 6-bit combinations has no character to represent it.

Base32 can represent 5 bits per character because log2(32) == 5. Anything in between 32 and 64 doesn't buy you anything because there is no integer between 5 and 6.

unshavedyak 742 days ago [-]

Is that "just" a performance concern though? Ie why is there a base58 and base62 but no base63?

Now you've got me curious on the performance of base58 to base64 hah. Down the rabbit hole i go. Appreciate your reply, thanks :)

TedDoesntTalk 743 days ago [-]

It’s not too difficult to write your own encoding. Probably 10 lines of code or less if you hard-code your encoding alphabet.

kbumsik 742 days ago [-]

What does “double clickable” mean?

d0mine 742 days ago [-]

Whether "double click" selects the whole id.

kbumsik 743 days ago [-]

> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

It would be great if you add suggestions for compound types (like “article-comment”) in README as OP stated as well.

spoiler 743 days ago [-]

It seems that's not allowed currently, if I'm reading it right. I'm not sure I like `-` very much. The reason why I don't like it is because of how double-click to select and line breaking works for the dash. Maybe allowing `_` in the typename, and the have the rightmost `_` serve as the separator might be more consistent.

But also, I'm bike-shedding and its only an ID

kbumsik 742 days ago [-]

I like using "." for this case. Because types definition typically belong to a package or module, which commonly uses "." for separator.

avgcorrection 743 days ago [-]

> The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

Yep. The readme asks people to provide other implementations. Having a test suite would be good for third-party code.

dloreto 742 days ago [-]

A follow up:

1. We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...

2. I clarified the prefix in the spec

Thanks for the feedback!

rtheunissen 743 days ago [-]

I'll write some tests for this.

jhoechtl 743 days ago [-]

I think your misconception is that prefix is sthg. fixed? You decided on the prefix depending on the usage domain.

tomcam 743 days ago [-]

> Add tests goddammit!

Hey, you’re pretty smart. How about you add them?

klabb3 743 days ago [-]

I’m by no means a test police. I’m in fact opposed to a lot of mindless testing for the sake of it. But there are places where unit tests shine, and this is one of them.

If you mean that criticism is only allowed if you are willing to commit labor, I disagree with that. I always welcome critique myself - it may be something that I’ve missed. The maintainers always has the last word. As long as there are no hidden expectations, it’s all good.

tomcam 742 days ago [-]

Do you need to be “test police“ to add tests you literally demanded of them?

aartav 743 days ago [-]

I've been doing this kind of thing for years with two notable differences:

1. I don't believe people actually hand type-in these values, so I'm not really concerned about the 'l' vs '1' issue. I do base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.

2. I add two base-32 characters as a checksum (salted of course). This is prevents having to go look at the datastore when the value is bogus either by accident or malice. I'm unsure why other implementations don't do this.

sokoloff 743 days ago [-]

> base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.

We had “analrita” as an autogenerated password that resulted in a complaint many years ago. Might consider adding ‘a’ as an excluded letter.

michaelt 743 days ago [-]

Presumably base 32 means 26 letters + 10 digits - 4 banned letters

So adding an excluded letter is not easy.

sokoloff 743 days ago [-]

Why not use base-31 and (optionally) more characters? (Or go upper and lower or add a symbol if you had to stay with a fixed-size and base-32 for some reason)

chipsa 743 days ago [-]

Because Base32 is just bit shifting and then converting the 5 bits into a char, and vice versa. Doing Base31 requires base conversion.

manquer 743 days ago [-]

Wouldn’t that be excluded because i is already removed ?

jhgg 743 days ago [-]

I don't think they took offense to the "rita" part.

bombcar 743 days ago [-]

analratassart would probably be just as bad

EGreg 743 days ago [-]

ianal

spoiler 743 days ago [-]

I realise you posted this as a joke, but the first time I saw this, I was so confused. I thought the comment was starting with "I anal" before I read the rest of the post only to compute it means "I Am Not A Lawyer"

rejiswe 742 days ago [-]

[flagged]

743 days ago [-]

tlrobinson 743 days ago [-]

I agree with the addition of the checksum, however I’m curious:

> either by accident or malice

1. if you don’t believe people hand type these then how else will they accidentally enter an invalid? I suppose copy/paste errors, or if a page renders it as uppercase, though you should just normalize the case if it’s base 32.

2. How does a 2 byte (non-cryptographically secure) checksum help in the case of malice?

dloreto 743 days ago [-]

The checksum idea is interesting. I'm considering whether it makes sense to add it as part of the TypeID spec.

veec_cas_tant 743 days ago [-]

What value does the checksum provide? I think I'm missing something because I really don't see a benefit.

diroussel 739 days ago [-]

The benefit is that you can reject bad requests to an API more easily.

For one application I used a base 58 encoded value. Part of it was a truncated hmac, which I used like check digits. This meant I could validate IDs before hitting the DB. As an attacker or script kiddie could otherwise try a resource exhaustion attack.

So in the age of public internet faceing APIs and app urls, I think built in optional check digit support is a good idea.

veec_cas_tant 739 days ago [-]

I struggle to see how 10 bits of check data will help much. I guess if the extra bits aren’t persisted to storage it doesn’t hurt so why not?

diroussel 738 days ago [-]

Storage can get corrupted, columns can be truncated. For the applications I tend to touch correctness and the ability to detect errors and tamper are more important that a couple of bytes per row. But every application and domain is different.

bumbledraven 743 days ago [-]

Checksums facilitate error detection. For typed UUIDs, checksums help detect errors introduced by changing the prefix/type or changing a “digit”.

zrail 743 days ago [-]

I implemented number two as part of an encoding scheme a few months ago. I'm not sure how much it's saved in terms of database lookups but it's aesthetically pleasing to know it won't hit a more inscrutable error while trying to decode.

ajkjk 743 days ago [-]

Unrelated, but this links to "Crockford's alphabet", https://www.crockford.com/base32.html , which is a base-32 system that includes all alphanumeric characters except I and L (which are confusable with 1), O (which is confusable with 0), and U (????). The page says the reason for excluding U is "accidental obscenity'. What the heck is it talking about?

kibwen 743 days ago [-]

> The page says the reason for excluding U is "accidental obscenity'.

Crockford is being cheeky. To make a nice base32 alphabet out of non-confusable alphanumeric characters you only need to exclude O, I, and L. This leaves you with 33 characters still, so you need to remove one more, and it doesn't matter which one you remove, so you might as well pick an arbitrary reason for the last character that gets removed (and it's not the worst reason, if your goal is to use these as user-readable IDs, although obviously it's not even remotely bulletproof).

pluijzer 743 days ago [-]

You could argue that U can be confused with V.

mtlmtlmtlmtl 743 days ago [-]

A vaguely related historical tangent is that V and U used to be just two ways of writing the same letter in Early Modern English. Which I imagine is why W is named as "double U" in speaking.

jabbany 743 days ago [-]

This is also interesting since in French (and I think Spanish?) W is (correctly) called "double V"

mtlmtlmtlmtl 743 days ago [-]

Same thing in Norwegian. But W is also not really a part of Norwegian orthography, it's just kind of there in the alphabet anyway. Only useful for names and maybe a couple loanwords.

namtab00 743 days ago [-]

In Romanian and Italian too.

TRiG_Ireland 743 days ago [-]

That is indeed why w is named "double u" in English (and "double v" in French. And why it's a vowel in Welsh.

See more from jan Misali: https://www.youtube.com/watch?v=sg2j7mZ9-2Y

EGreg 743 days ago [-]

Same with B and V, P and F, SS and Z, etc.

It is why PH is F, for instance.

dmurray 743 days ago [-]

5 and S seems more likely.

quickthrower2 743 days ago [-]

Fvck!

743 days ago [-]

dolmen 743 days ago [-]

This assumes that english is the only relevant language regarding curse words. Which is quite biased.

quickthrower2 743 days ago [-]

U is a fairly new letter anyway.

codeulike 743 days ago [-]

If I and O are already excluded and you also exclude U that removes a lot of potential rude looking three letter combinations like *** and *** and *** and also the four letter ones like **** and **** and the dreaded ****. Of course because you have A then **** is still a possibility but very very unlikely

titanomachy 743 days ago [-]

Wow I didn't know HN even had obscenity filters, and I've been here for many years.

Guess that's a credit to the general civility of the community.

EDIT: It appears that other people in this thread are freely using profanity, so either your comment was targeted by automation due to the unusual density of banned words, or it's a joke that went over my head :)

taberiand 743 days ago [-]

No obscenity filters, but there is a pretty good password filter I hear. For example, my password 'hunter2' will be all **** to you

9dev 743 days ago [-]

Isn’t it nice how some traditions do stick around. It’s been a while, Cthon98!

redler 743 days ago [-]

Sadly, this is one of those things that, soon, if not now, only the olds will know about.

astrospective 743 days ago [-]

Like telling people asking for game cheats to use Alt-F4.

rbera 743 days ago [-]

That explains it, I was very confused by what I assumed was self-censoring, since the comment didn’t actually clarify anything. I wish there was an accepted way to disambiguate asterisks from server side filters.

macintux 743 days ago [-]

I assumed this was a riff on the classic bash.org transcript.

http://www.bash.org/?244321

Aeolun 743 days ago [-]

But we know there are no server side filters here. At least I never had any profanity censored by anything other than the Mk 1 brain.

743 days ago [-]

AceJohnny2 743 days ago [-]

you accidentally the whole thing

EGreg 743 days ago [-]

I think the point

pavlov 743 days ago [-]

True Latinists find the letter U vulgar to the point of obscenity because it didn’t exist in Cicero’s time.

oleganza 743 days ago [-]

Trve Latinists wovld appreciate yovr point.

littlestymaar 743 days ago [-]

Gotcha, there was no “W” in the Latin alphabet either ;)

Zamicol 743 days ago [-]

There's more!

- base 58 - Satoshi's/Bitcoin's https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58

- "base62" - Keybase's saltpack https://github.com/keybase/saltpack

- The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85

- basE91 - https://base91.sourceforge.net

At work we defined several new "bases" for QR code. IMHO, it is an under applied area of computer science.

hinkley 743 days ago [-]

A coworker and I came up with basically this same set about 4 years before Crockford. We were trying to solve the url slug problem, and they were long enough that we felt 5 bits per byte would reduce transcription annoyances.

In the end I think we had a couple of characters to spare, and so, sitting by ourselves because everyone else had gone home for the day, we ranked swear words by how offensive they were to prioritize removal of a few extra letters. Then I convinced him that slurs were a bigger problem so we focused on that, which got rid of the letter n, instead of u

tggr is just cute, n**r is an uncomfortable conversation with multiple HR teams (we were B2B)

I'm a bit fuzzy now on what our ultimate character set was, because typically you're talking [a-z][0-9], an there are a lot of symbols you can't use in urls and some that are difficult to dictate. My recollection is that we eliminated both 0, l, and 1, but I think we relied on transcription happening either from all caps or all lowercase. 0o are not a problem. Nor are 1L.

hinkley 743 days ago [-]

Other comments are jogging my memory. I think we went case sensitive (62 characters -> 30 spares), eliminated aA4, eE3, iI1l oO0 (maybe Q), uU, which is 16 characters, 14 to go. Remove the remaining 7 numbers (once you remove most for leetspeak what's the point of the rest?), nN, yY. That leaves 2 left and I can't recall what we did with those. Maybe kK or rR.

Y is pretty versatile for pissing people off.

theptip 743 days ago [-]

The Scunthorpe problem?

https://en.m.wikipedia.org/wiki/Scunthorpe_problem

pizzapill 743 days ago [-]

E-Mail accounts seem the worst. Just lets write letters again, if you need a pencil I recommend penisland.net

TedDoesntTalk 743 days ago [-]

Penis mean “tail” in Latin.

pizzapill 743 days ago [-]

As they say Your pen is Our business!

programmarchy 743 days ago [-]

There's enough comedic content in this article for several Silicon Valley episodes.

jszymborski 743 days ago [-]

FUCK

Racing0461 743 days ago [-]

yep, youtube video ids has/had? same issue where it would have things like fag/f4g etc in it.

eg: google "allinurl:fag site:youtube.com"

stronglikedan 743 days ago [-]

You can prevent any obscenity, O and 0 confusion, and I and L confusion, just by excluding vowels. If someone interprets "f4g" in an offensive way, then they have bigger issues than can be dealt with in software.

MR4D 743 days ago [-]

Obscenities change with language. That’s why every language has them.

Even programming language have them. For instance, Basic has GOTO.

ZeroClickOk 743 days ago [-]

and javascript has type coercion

arcticbull 743 days ago [-]

There was that time Delta generated an "H8GAYS" PNR. [1] Pretty sure that's valid Crockford encoding too :) however, to your point, it does rely on 'A'. "H8G4YS" would likely still offend someone out there, though, given the kerfuffle in [1].

[1] https://newsfeed.time.com/2013/12/17/delta-airlines-is-very-...

kortex 743 days ago [-]

> But as White points out, it’s a bit surprising that Delta didn’t block this particular combination as a possibility. “I’m sure they removed many four-letter words that would be seen as offensive,” he tells the Post. “I’m surprised that ‘gays’ and ‘H8’ weren’t blocked as well.”

Oh sweet summer child (meaning Jeff White, not OP/GP). As someone who has implemented a censorship/filtering list, this is a UX problem on the same level of decideability as the halting problem. You can spend boundless time curating a list to flag/grawlix every possible string that would offend even the most prudish of prudes, and some would still get through. Such as the superficially benign "EATTHE"

https://www.dailymail.co.uk/news/article-2039662/Virginia-dr...

taosx 743 days ago [-]

Why we care about obscenity in pseudo-random ids and url?

whimsicalism 743 days ago [-]

anglo morals

Racing0461 743 days ago [-]

i don't, the people that can start a **storm on twitter/tumblr causing your stock to drop 10% does.

oleganza 743 days ago [-]

The problem is not being offended per se, but having your user id accidentally become "user_123fuck567" — that's akin to having a vulgar license plate on your car's forehead. People don't appreciate how lucky they sometimes are.

nephanth 743 days ago [-]

Are those supposed to be user- facing though?

lolinder 743 days ago [-]

Using Stripe-style prefixing is most useful if they are—it makes interacting with customer support easier because everyone is on the same page about what this ID identifies.

avgcorrection 743 days ago [-]

> The page says the reason for excluding U is "accidental obscenity'. What the heck is it talking about?

Because he’s an American?

deanmen 743 days ago [-]

The F word has a U in it. Sure you could just say FVCK

bongobingo1 743 days ago [-]

Or Fwck if you doubly mean it.

743 days ago [-]

programmarchy 743 days ago [-]

Yeah, wtf?

atonse 743 days ago [-]

Obviously by “wtf” you must mean “why the face?” Right? Right?? :-)

inopinatus 743 days ago [-]

I'm not wild about the Crockford encoding. In practice I've found it to be a flat-out mistake when you come to provide technical support or analysis for values encoded this way. The Crockford alphabet is based on design goals that are rarely encountered in practice, such as pronouncing identifiers over the phone. It introduces ambiguity, which is a disaster for grepping logs or any other circumstances where you might query or cross-reference based on the encoded string instead of the decoded value, then permits hyphens, a leading source of cut-and-paste and line-break errors.

Note that people generally do not type in object identifiers, but they do frequently cut-and-paste them between applications and chat/forum interfaces, forward them by email, search for them in log files. Verbal transmission is rare to non-existent. Under these conditions, pronunciation proves irrelevant, and case-insensitivity becomes an impediment, but consistency and paste/break resilience become necessary.

Base 58 offers a bijective encoding that fits these concerns much more effectively and is more compact to boot. Similarly inspired by Stripe, I've been using type-prefixed base58-encoded UUIDs for object identifiers for some years. user_1BzGURpnHGn6oNru84B3Ri etc.

Edit to add: to be fair to Douglas Crockford, his encoding of base 32 was designed two decades ago, when the usage landscape looked quite different.

dloreto 743 days ago [-]

I hear you ... and I debated using either base58 or base64url. I do like the more compact encoding they provide.

Ultimately I ended up leaning towards a base32 encoding, because I didn't want to pre-suppose case sensitivity. For example, you might want to use the id as a filename, and you might be in an environment where you're stuck with a case insensitive filesystem.

Note that TypeID is using the Crockford alphabet and always in lowercase – *not* the full rules of Crockford's encoding. There's no hyphens allowed in TypeIDs, nor multiple encodings of the same ID with different variations of the ambiguous characters.

742 days ago [-]

ash 743 days ago [-]

I agree that pronouncing identifiers over the phone is rare. But I’m occasionally typing identifiers from:

1. a screenshot or a screen share that contains an identifier

2. another device where I can’t easily take an identifier

inopinatus 743 days ago [-]

That's fair. From experience I think the most common problem with screenshots is [0O] and [Il] ambiguity. As a point of comparison I'm willing to suggest that both base58 and crockford32 handle the matter reasonably, albeit differently, through their omitted-characters and decoding tables.

One feature I do like from crockford32, that base58 lacks, and which also assists transcription from noisy sources, is the check symbol. So much that it is quite unfortunate that this check symbol is optional. In 2023 it's hard to fight the urge to specify a mandatory emoji to encode a check value (caveat engineer: this is not actually a good idea :))

Lazare 743 days ago [-]

I agree; base58 or base62 (which KSUIDs use) have a lot to recommend them. Crockford's base32 works, but I don't love it.

My first choice would be to just use type-prefixed KSUIDs, which gives you 160-bit K-sortable IDs with base62 encoding, which works great unless you need 128-bit IDs for compatability reasons.

yencabulator 742 days ago [-]

Wait, where's the hyphen in Crockford Base32? https://en.wikipedia.org/wiki/Base32#Crockford's_Base32

My favorite base-32 encoding is z-base-32, which I find just gentler on the eyes: https://philzimmermann.com/docs/human-oriented-base-32-encod...

The biggest problems with base58 are 1) it works for integers, less so for arbitrary binary data like crypto keys 2) case-sensitivity ISnOtNIcEtOLoOKaT (in my opinion).

inopinatus 742 days ago [-]

The specification of crockford32 is at https://www.crockford.com/base32.html

z-base32 has some nice ideas, although I don’t really give a damn how these things look except where that has functional/ergonomic consequences, since none of them have real aesthetic value. The beauty of numbers is in their structural properties, not their representations. If we really cared about how it feels I’d suggest using an S/KEY-style word mapping instead to get some poetry out of it.

yencabulator 742 days ago [-]

Ah, "Hyphens (-) can be inserted into symbol strings." If you don't like 'em, don't use 'em.

And the only point of these encodings is to be more human-palatable than base64. If you take that goal out of the equation, just use base64, it's better (denser, more readily supported everywhere). Various base32's and base58 exist because base64 was not human friendly enough.

742 days ago [-]

stephen 743 days ago [-]

Neat! Love the "type-safe" prefix; we'd called them "tagged ids" in our ORM that auto-prefixes the otherwise-ints-in-the-db with similar per-entity tags:

https://joist-orm.io/docs/advanced/tagged-ids

We'd used `:` as our delimiter, but kinda regretting not using `_` because of the "double-click to copy/paste" aspect...

In theory it'd be really easy to get Joist to take "uuid columns in the db" and turn them into "typeids in the domain model", but probably not something that could be configured/done via userland atm...that'd be a good idea though.

wongarsu 743 days ago [-]

Reddit does something similar, but optimized for string length: elements have ids like "t3_15bfi0" where t3_ is a prefix for the type (t3 is a post, t1 a comment, t5 a subreddit, etc) and the remaining is a base36 encoding of the autoincrementing primary key.

stephen 743 days ago [-]

Nice!

The `t<X>` makes sense; we currently guess a tag name of "FooBarZaz" --> "fbz", but allow the user to override it, so you could hand-assign "t1", "t2", etc. as you added entities to the system.

Abbreviating/base36-ing even the auto-incremented numeric primary key to optimize lengths is neat; we haven't gotten to the size of ids where that is a concern, but it sounds like a luxurious problem to have! :-)

hamburglar 743 days ago [-]

My company has a typed internal ID system that originally used colons as delimiters but we quickly switched to dots (.) as the delimiter because it’s very annoying to have url-encoded IDs balloon in size because colons need to be %-encoded. Makes your urls ugly and long.

wood_spirit 743 days ago [-]

UUIDv7 has been taking HN by storm for years now! When is it going to become a proper standard, and when are libraries and databases and all the rest going to natively support it?

vbezhenar 743 days ago [-]

What kind of support do you expect? I'm pretty sure that absolute majority of software does not care about any particular bits in UUID, so you can use it today. If some software cared about any particular bits, just imitate UUIDv4, I mean those bits could be randomly generated as well. If you need generation procedure, write it yourself, it's easy.

dolmen 743 days ago [-]

IDs generation is usually private to a company scope and rarely need to be "universally unique".

kijeda 743 days ago [-]

It would appear to be in the final stages of standardization in the IETF: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...

Daegalus 743 days ago [-]

Its been going through drafts and improvements. It's very close to being standardized, and many libraries are supporting it already, or new offerings are being added. For example I maintain the Dart UUID library, and my latest beta major release has v6, v7 and a custom v8. There is a list of them somewhere, I know I get pinged on every new draft by the authors because I am listed as a library maintainer on one of their pages.

Nelkins 743 days ago [-]

How much does it change between drafts? Close enough to where I could use it in production?

Daegalus 743 days ago [-]

Seeing as how its nearly done, it doesn't change much. It changed more often in the beginning, but its like on its final draft, or near final draft. I think the IETF plans to make final soon.

eezing 743 days ago [-]

“…can be selected for copy-pasting by double-clicking”

Details matter.

bombela 743 days ago [-]

I have some complaints about UUIDs. Why not just combining time + random number without the ceremony of UUID versioning. And for when locality doesn't matter, just use a 128bit random number directly.

And in my experience most people somehow think a UUID must be stored into the human friendly hex representation, dashes included. Wasting so much space in database, network, memory.

rjh29 743 days ago [-]

Many people had the same idea. For example ULID https://github.com/ulid/spec is more compact and stores the time so it is lexically ordered.

jerf 743 days ago [-]

While this isn't the worst area I see this in, there does seem to be a tendency in the UUID space to speak as if one use case stands for all and therefore there is a best UUID format.

The reality is that it is just like any other engineering situation. Sit down, write down your requirements, and see what, if anything, solves it.

Reading about the advantages of various formats is very helpful in helping you skip learning about certain things the hard way and use somebody else's experience of learning them the hard way instead. From that point of view I recommend at least glancing through them all. Sortability and time-based locality is one that you may not naturally think about, and if you need it, you will appreciate not learning that the hard way four years into a project after you threw that data away and then realizing you needed it. And some UUID formats actually managed to introduce small security issues into themselves (thinking MAC address leak from UUID v1 here), nice to avoid those too.

If you have a use case where there's an existing solution then, hey, great, go ahead and use it. Maybe if anyone ever needs that but in another language they can pull a library there too.

But if not, don't sweat it. The biggest use of UUIDs I personally have I specified as "just send me a unique string, use a UUID library of your choice if it makes you feel better". I think I've got a unique format per source of data in this system and it's fine. I don't have volume problems, it's tens of thousands of things per day. I don't have any need to sort on the UUID, they're not really the "identifier", they're just a unique token generated for a particular message by the originator of the message so we can detect duplicate arrivals downstream in a heterogenous system where I can't just defer that task to the queue itself since we have multiple. I don't even need them to be globally unique, I just need them unique within a rather small shard, and in principle I wouldn't even mind if they get repeated after a certain amount of time (though I left the system enforcing across all time anyhow for simplicity). In this particular case, I do indeed generate my own UUIDs for the stuff I'm originating by just grabbing some stuff from /dev/urandom and encoding it base64, with a size selected such that base64 doesn't end the encoding with ==. Even that's just for aesthetic's sake rather than any actual problem it would cause.

stronglikedan 743 days ago [-]

> combining time + random number

You can't guarantee that this will be globally unique.

ceejayoz 743 days ago [-]

No identifier can guarantee that. We just get close enough to be acceptable.

Per Wikipedia, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

so-youre-saying-theres-a-chance.gif

jandrewrogers 743 days ago [-]

I have single datasets with trillions of UUID. Collision probability becomes a thing.

That aside, UUIDv4 is banned in many orgs because there have been several instances in the wild where the “random” number wasn’t nearly as random as advertised from some sources for a variety of reasons, leading to collisions. It is relatively easy to screw this up so many orgs don’t risk it.

vore 743 days ago [-]

A 1 in a billion collision probability?

743 days ago [-]

duped 743 days ago [-]

A billion is not that big of a number for UUIDs

ceejayoz 743 days ago [-]

Re-read. You'd have to generate 103 trillion to have a one billionth chance of a collision.

A billion isn't that big a number, but 103 trillion is.

jandrewrogers 743 days ago [-]

I think you made a mistake in your math. The Birthday Collision probability of just a trillion random UUID is much higher than that.

ceejayoz 743 days ago [-]

Feel free to update https://en.wikipedia.org/wiki/Universally_unique_identifier#..., but it does note "This probability can be computed precisely based on analysis of the birthday problem". It does show the formula used.

deathanatos 743 days ago [-]

Wikipedia is correct, AFAICT.

The probability of 1 trillion UUIDs having a collision is,

  def birthday_collision(n, m):
      return 1 - math.e ** (-((n -1) * n) / (2 * m))

  In : birthday_collision(1_000_000_000_000, 2 ** 122)
  Out: 9.403589018575076e-14

That number is roughly the approximation given in Wikipedia.

I.e., at 1T UUIDs, it hasn't happened. For comparison, the odds of being struck by lighting (over a lifetime) is many orders of magnitude greater:

  6.535947712418301e-05

hot_gril 741 days ago [-]

The only worthwhile UUID standard IMO is v4 (simple random), and I still don't get why it needs dashes. The other ones don't really accomplish anything.

deepsun 743 days ago [-]

The worst thing about dashes is you cannot easily double-click it whole to copy-paste.

Xeoncross 743 days ago [-]

Assuming you don't need to use UUIDv7 (or any UUID's) then https://github.com/segmentio/ksuid provides a much bigger keyspace. You could just append a string prefix if you wanted to namespace, but the chance of collisions of a ksuid is many times smaller than a UUID of any version.

ksuid is the best general purpose id generator with sort-able timestamps I've found and has libraries in most languages. UUID v1-7 are wasteful.

sophiabits 743 days ago [-]

I have a thin JavaScript package published which generates type-prefixed KSUIDs. KSUID is a great format assuming you can spare the extra bits (which is most people)

[1] https://github.com/sophiabits/resource-id

rtheunissen 743 days ago [-]

I moved to ULID because they are always lowercase and therefore case-insensitive.

Lazare 743 days ago [-]

ULIDs work, but:

The spec has a really weird choice in it - it attempts to guarantee monotonicity. There's a few problems with that, namely that it's almost entirely pointless, it's impossible to actually guarantee it, and the spec mandates it be done in a bad way. Breaking those down:

* There's no particular reason why you'd expect UIDs to give you monotonic ordering, and no particular use case for it either. Being roughly k-sortable is very useful (eg, to allow efficient DB indexes), but strict monotonicity? There's just very few cases where that's needed, and when it is, you'd probably want a dedicated system to coordinate that. Even worse, if you do need it, ULIDs don't actually guarantee it. Which brings us to:

* UIDs are generally most useful in distributed systems, and the more you scale the more that will become required. The second you have a second system generating UIDs, your monotonicity guarantees go out the window, and without work that the various ULID libraries have not done, that's even true the moment you have a second process generating them. And the ULID spec doesn't even try and work around this (nor do all ULID libraries try to guarantee monoticity even when used in a single threaded manner), which in turn means you have no idea if any given ULID was generated in a way where monoticity is being attempted.

* The actual ULID spec says the UID is broken up into a timestamp and a random component, and if you ever generate a second ULID in the same millisecond, you must increment the random component, and if you can't due to overflow, generation must fail. Which is wild; it means you can only generate a random number of ULIDs per millisecond, and there's a chance you can only generate one. Even worse, if you can ever manage to cause a system to generate a ULID for you in the same millisecond as it generates one for your target, you can trivially guess the other ULID! Many use cases for UIDs assume they're effectively unguessable, but this is not true for compliant ULIDs.

The ULID spec is, a best, a bit underbaked. (And as others have noted, I find that Crockford's base32 encoding is a suboptimal choice.)

tasn 743 days ago [-]

We maintain a couple of popular ksuid libraries[1][2] and use it, so we definitely like ksuid. Though one big issue with ksuid is that being 160bit means that it doesn't fit into native uuid types in databases (e.g. postgres), which means that they come with a performance penalty.

1: https://github.com/svix/rust-ksuid 2: https://github.com/svix/python-ksuid

Xeoncross 743 days ago [-]

I'm curious, why do you not store these as binary data or do you and you're saying that the UUID operations are better optimized than sorts on binary data?

snuxoll 743 days ago [-]

I can compare a 128bit UUID in a single instruction, a 160-bit ksuid is a little weirder to work with at the hardware level.

tasn 743 days ago [-]

Exactly what the sibling said, and the same applies to database operations (when they have a uuid type).

lll-o-lll 743 days ago [-]

K-Sortable is a great concept; having weakly sorted keys solves a bunch of use-cases. I really like the idea of a typed, condensed string representation. However I wonder if an unintended side affect of UUID V7 is going to be a bunch of security problems.

People aren’t meant to use uuids as tokens, and they aren’t supposed to use PKs from a DB for this either - but they do. Because UUID v4 is basically crypto random, I think we’ve been getting away with a bunch of security weaknesses that would otherwise be exploited.

With UUID v7 we are back to 32bits of actually random data. It’s going to require some good educating to teach devs that uuids are guessable.

[edit] Looks like I am off base with the guess-ability of the V7 UUID, as the draft recommends CSPRNG for the random bits, and the amount of entropy is at least 74 bits and it is specifically designed to be “unguessable”. It does say “UUID v4” for anything security related, but perhaps that is simply in regard to the time stamp?

atonse 743 days ago [-]

This brings up an interesting ergonomics problem.

By naming them UUIDv4 and UUIDv7, is it going to be this never ending confusion for people to have to remember which one is good for databases and which one good for one time tokens?

Not sure what the backwards compatible solution here is either.

In elixir the function is UUID.uuid4() to generate a v4 UUID.

So we could theoretically scan code for its use I suppose. But all this increases chances of errors.

hot_gril 742 days ago [-]

> is it going to be this never ending confusion for people to have to remember which one is good for databases and which one good for one time tokens

Yes, because this is what's been happening already with the past versions. It's not just sequential and random, there are also hash-based UUIDs. They shouldn't have sequential (heh) version numbers.

hot_gril 742 days ago [-]

I can see some use cases for it, but every time in the past I've encountered other kinds of partially-sequential UUIDs like v1 or v5, they've been misused. Same with the hash-based ones like v3. v4 is simple and not prone to misuse.

vikeri 743 days ago [-]

Another, less known, useful thing about these IDs is that you can double click on them and the full id will always be selected

Eduard 743 days ago [-]

Also, they are safe to use within filenames and directory names (Filesystem paths) without conversion (at least in today's Filesystem not limited to e.g. 8.3 characters) .

Compare that with otherwise nice ISO 8601 datetime format (e.g. 2023-06-28T21:47:59+00:00): it requires conversion for file systems that don't allow colons and plus signs.

jrockway 743 days ago [-]

This is a setting in your terminal emulator. For me, plain UUIDs are selected just fine when double clicking.

mojuba 743 days ago [-]

There's life outside of the terminal. For example you want to double-click on the part of a URL in your browser.

avarun 743 days ago [-]

It's mentioned in the README. "can be selected for copy-pasting by double-clicking"

hot_gril 743 days ago [-]

I don't understand putting type names into DB row IDs. You're safest using whatever IDs in your DB make it happiest (usually bigserial in Postgres), and the needs might be pretty specific. Whenever you want to log row IDs, you add whatever context is needed, which will probably include things besides the ID either way. When you want to share an identifier with a customer, you use something entirely different.

The UUIDv7 properties are interesting, but it's worth noting that at least one DBMS really doesn't like the K-sortable property: https://cloud.google.com/spanner/docs/schema-design#uuid_pri...

quelltext 742 days ago [-]

Just think about logging polymorphic object IDs. If you see an identifier with a type you can immediately know what table it is in and you can even build super simple tools like browser extension that let you look up objects by IDs as is. You can also just share these IDs as is with folks on Slack to debug etc. and they know exactly what kind of entities you are talking about.

Super useful and I honestly don't wanna go back to a project that doesn't use this pattern.

hot_gril 742 days ago [-]

Those debug IDs (or URLs, etc) are worth having for the reasons you describe, but it doesn't mean you use those as primary keys in the DB. Something only needs to print them, and debug tools need to understand them.

743 days ago [-]

swyx 743 days ago [-]

for those researching this topic, I keep a list of these UUID/GUID implementations!

https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...

crdrost 743 days ago [-]

Thanks for this!

I have one idea which is perhaps nerdy enough to make the list but I've never fully fleshed it out, it's that one can encode the nonnegative integers {0, 1, 2, ...} into the finite bitstrings {0, 1}* in a way which preserves ordering.

So if we use hexits for the encoding the idea would be that 0=0, 1=1, ... E=14, then

    F00 = 15
    F01 = 16
    ...
    F0F = 30
    F100 = 31
    F101 = 32
    ...
    F1FF = 286
    F2000 =

so the format is F, which is the overflow sigil, followed by a recursive representation of the length of the coming string, followed by a string of hexits that long.

What if you need 16 hexits? That's where the recursion comes in,

    F F00 0123456789ABCDEF
     \   \    \
      \   \    \----- 16 hexits
       \   \
        \   \-- the number 15, "there are 15+1 digits to follow"
         \       (consisting of overflow, 0+1 digits to follow, and hex 0)
          \ 
           \--- overflow sigil

Kind of goofy but would allow a bunch of things like "timestamp * 1024 + 10-bit machine ID" etc without worrying about the size of the numbers involved

Ironchefpython 743 days ago [-]

https://datatracker.ietf.org/doc/html/rfc2550

Note that this RFC also supports lexicographic sorting for negative numbers using the 10k complement notation.

CMCDragonkai 743 days ago [-]

We have a uuidv7 implementation that we've been using with rocksdb for over a year https://github.com/matrixai/js-id

dap 743 days ago [-]

It's weird to me that people talk about UUIDv4 as bad for databases because of locality. This behavior can be good or bad, depending on what you're doing. Sure, UUIDv4 is worse if you're using it to locate a bunch of related objects that you want to be on the same database node. (If you're doing that, though, you can often attach some kind of group identifier -- e.g., "user" -- and index that.) I generally prefer UUIDv4 in a lot of distributed database applications because as a sharding key it's likely to distribute data well across all available nodes.

I debugged a system that used the equivalent of time-ordered UUIDs where even though the database was horizontally scaled, the performance was limited by the capacity of one server. It was exactly because the uuids being generated started with similar prefixes, so at any given time, all the current data was going to whichever database node was responsible for that range of the keyspace.

hot_gril 741 days ago [-]

Yep. Spanner docs even tell you not to use time-ordered keys.

kortex 743 days ago [-]

Does the prefix ("user_") get recorded in the DB (so every string in the column starts with the same "user_"), or does are there constraints and other clever chicanery to save those bytes in every record? Or do modern DB engines even care? Is this premature optimization?

carlsverre 743 days ago [-]

The authors have created a specialisation for Postgres that leverages a custom type which is a tuple of type and uuidv7: https://github.com/jetpack-io/typeid-sql/blob/main/sql/typei...

This is more optimal for Postgres while making it slightly more difficult to interop between the db and the language (db driver needs to handle custom types, and you need to inject a custom type converter).

And while there are hacks you can do to make storing uuid-alikes as strings less terrible for db engines, if you want the best performance and smallest space consumption (compressed or not) make sure to use native ID types or convert to BINARY/numeric types.

hot_gril 742 days ago [-]

In my experience, using just uuid as a pkey in Postgres already causes noticeable slowdowns vs the typical bigint. I wouldn't jump into anything other than bigint pkeys unless I'm solving an existing problem and have benchmarks to prove it.

sophiabits 743 days ago [-]

Something really nice about type prefixes in IDs is it makes “Global Object Identification” [1] in GraphQL super straightforward. The node query can simply inspect the ID and immediately know where to fetch the object from, whereas with “regular” IDs you either need to perform a bunch of database queries or otherwise maintain some index that maps IDs->types.

Not implementing this isn’t the end of the world, but it makes refetching individual bits of data in your cache far easier. It’s really nice being able to implement it basically for free.

[1] https://graphql.org/learn/global-object-identification/

jszymborski 743 days ago [-]

This is very similar to how I generate IDs in a project I'm working on.

Example:

    |-A-|-|------------B--------------|
    NMSPC-9TWN1-HR7SV-MTX00-0H8VP-YCCJZ
    A = Namespace, padded to 5 chars. Max 5 chars. Uppercase.
    B = Blake3 hashed microtime with a random key.

I like how it folds in a time component but that it also doesn't reveal the time it was generated.

Here's the snippet: https://gist.github.com/jszym/d3c7907b7b6e916f68205c99e5e489...

goostavos 743 days ago [-]

Namespacing identifiers in general is a great idea for handling those class of integration tests which cannot be fully isolated. It makes it easy to write all kinds of garbage from even concurrently running tests all without any of them colliding or accidentally reading each others writes (because they are themselves namespace aware!). It's low effort to get all the pieces of your system to play along (often entirely transparent via DI), but gives a huge power to weight ratio. Basically deletes an entire class of problems which usually plague large, mature test suits

rowbin 742 days ago [-]

Do you generate a random key for each id? If so, you loose the time component of your I'd, because two hashes made at a different time can produce the same hash when using a different key (I think). If you use a single key you need to manage it and keep it secret which isn't really ideal for something simple like id's. If you reveal the key you could as well use an unkeyed hash, right? Maybe a salted hash with the same salt for every key to make dictionary attacks on the creation time harder. But then again it's probably easier to just generate random bits for the id, as you don't have sortability in any case.

pphysch 743 days ago [-]

I implemented something similar recently, but opted to write my own UUIDv7 that uses the last 16 flexible bits for a "type ID". That allows 65K different data models, which should be more than enough. Could even partition that further to store a node ID for a globally distributed setup.

So it's got all of the above perks, but it also an actual UUID and fits in a Postgres UUID column.

It's very cool to be able to resolve a UUID to a particular database table and record with almost zero performance overhead (cached table lookup + indexed record select).

Daegalus 743 days ago [-]

Ideally, you should version that as UUIDv8. Since that is a more custom implementation, but I guess changing the random bits with type info works fine for UUIDv7, they jsut arent random anymore.

pphysch 743 days ago [-]

Not every bit of UUID is required to be random.

The goals are smallness, uniqueness, monotonicity, resistance to enumeration attacks, etc. Not randomness for randomness sake.

My UUIDv7+ can be consumed as a standard UUIDv7. It is not intended to be v8. A program can treat the last 16 bits as random noise if it wants.

hfkwer 743 days ago [-]

> It is not intended to be v8.

There is already a UUIDv8. It's defined as vendor-specific UUID. https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

> Some example situations in which UUIDv8 usage could occur:

> An implementation would like to embed extra information within the UUID other than what is defined in this document.

Isn't that exactly what you are doing?

Daegalus 743 days ago [-]

I am aware, just saying per spec, its supposed to be random bit data, thats all I was saying. I am familiar with a spec since I maintain a UUID library that has 6,7, and a custom 8 implemented.

It can have extra monotonicity data instead, per section 6.2 but ideally its random. Again, Not saying you can't do what you are doing, I just know per the conversations while the draft was gathering feedback, your type of change was intended to be done as uuidv8

pphysch 743 days ago [-]

> per spec, its supposed to be random bit data

> It can have extra monotonicity data instead

Well, which is it? These are incompatible requirements.

If I give you a standard UUIDv7 sample, it is impossible for you to interpret the last 62 bits. You cannot determine how they were generated. If I give you two samples with the same timestamp, you cannot say which was generated first. These bits are de facto uninterpretable, unlike e.g. the 48 MSB, which have clearly defined semantics.

Daegalus 743 days ago [-]

Well, that might be an ambiguity that needs to be brought up before its final if it is an issue.

So if we look at https://www.ietf.org/archive/id/draft-ietf-uuidrev-rfc4122bi...

For list item #3 it says "Random data for each new UUIDv7 generated for any remaining space." without the word "optional" and the bit layout diagram says `rand_b`

But when you read the description for `rand_b` it says: "The final 62 bits of pseudo-random data to provide uniqueness as per Section 6.8 and/or an optional counter to guarantee additional monotonicity as per Section 6.2."

Reading section 6.2 https://www.ietf.org/archive/id/draft-ietf-uuidrev-rfc4122bi..., it all involves incrementing counters, or other monotonic random data.

If you can guarantee that you custom uuidv7 is globally unique for 10000 values per second or more, I don't see why you can't do what you do and treat your custom data as random outside of your implementation.

I think part of this is my mistake, because I assumed you replaced most of the random data with information, but reading it now, I read that you replaced just the last 16 bits. Also since most people used random data for UUIDv1's remaining 48bits of `node` then your variation is no worse than UUIDv1 (or 6) while also being compatible with v7.

I think I just got too caught up on the the bit layout calling it `random` and misread your information. Sorry for the misunderstanding, and thanks for discussing it.

jeremyjh 743 days ago [-]

Well UUIDv7 can be consumed as a UUIDv4 in the same way, its just 16 bytes. The point of the standard is to define how the particular bytes are chosen.

pphysch 743 days ago [-]

The latest standard for v7 does not meaningfully describe how to interpret the last segment.

It says they could be pseudorandom and non-monotonic. Or it could be monotonic and non-random. These are completely disjoint cases! "X or not X" is tautological. And there is no way to determine which (e.g. there could be a flag that indicates this mode, but there is not).

To be clear, the standard should be amended to resolve this ambiguity. Say the last bits MAY be monotonic or MAY be pseudorandom. Or add a flag that indicates which.

As there is currently no standard way to interpret these bits, I feel perfectly justified in using the a few of the least significant ones to encode additional information.

jeremyjh 743 days ago [-]

I think the purpose of the standard is so that different software implementations work the same way, so that once you've picked a standard, you can use it everywhere and know that keys are assigned the same way regardless of which software stack is generating a particular key. Its not so that systems can "interpret" it. Obviously they are your bytes to use however you want if you are rolling your own generator.

pphysch 743 days ago [-]

> you can use it everywhere and know that keys are assigned the same way regardless of which software stack is generating a particular key.

But even if you follow the standard to a tee, you cannot infer anything about how the last 62 bits were assigned. That is my point!

dralley 743 days ago [-]

Postgresql doesn't care, it's not going to "interpret" those bits, it is just a 128-bit integer.

pphysch 743 days ago [-]

And I'm glad for it, because I could implement this without needing an extension or update to PostgreSQL.

atulvi 743 days ago [-]

Naive Question: The type safe part is just appending a string at the beginning? What if I do that with UUIDv4? is user_49b9cd12-9964-4b9c-8512-742f0a2c9be4 type safe now?

dloreto 743 days ago [-]

That's how the type is encoded as a string, but type-safety ultimately comes from how the TypeID libraries allow you to validate that the type is correct.

For example, the PostgresSQL implementation of TypeID, would let you use a "domain type" to define a typeid subtype. Thus ensuring that the database itself always checks the validity of the type prefix. An example is here: https://github.com/jetpack-io/typeid-sql/blob/main/example/e...

In go, we're considering it making it easy to define a new Go type, that enforces a particular type prefix. If you can do that, then the Go type system would enforce you are passing the correct type of id.

davidjfelix 743 days ago [-]

Yep. The whole point is that you /never/ assign ids that begin with "user" to types that are not users. Because of that, you can be sure nobody can accidentally copy an id that begins with "user" when meaning to address a different type and get back a result other than "not found".

Example:

I have userId=4 and userId=2. Suppose a user can have multiple bank accounts and userId=4 has accountId=5 and accountId=6 and a defaultAccound accountId=5. userId=2 has an account, accountId=7; I want to send userId=4 some money so I use the function `sendUserMoneyFromAccount(to: int, from: int)`. This is a bad interface but these things exist in the wild a lot. I could accidentally assume that because I want to send userId=4 the money to their default account that I would call it using `sendUserMoneyFromAccount(4, 7)` and that would work, but if under the hood it wants 2 accountIds, I've just sent accountId=4 money rather than userId=4's defaultAccount, accountId=5.

With prefixed ids that indicate type, a function that assumes type differently from the one supplied will not accidentally succeed.

In addition, humans who copy ids will be less likely to mistake them. This is just an ergonomic/human centric typing.

darajava 743 days ago [-]

In the readme, Crockford’s alphabet is referenced [0]. In this specification, “U” is excluded because it is an “accidental obscenity”. Does anyone have any idea what that means? Is it a joke?

0. https://www.crockford.com/base32.html

autoexec 743 days ago [-]

I'm guessing that it's because U just happens to appear in some popular four letter words, which might make things awkward when you're reading letters out to another person over the phone. It might also come before a lot of other rude letter combinations making them seem more personal.

743 days ago [-]

tlrobinson 743 days ago [-]

I really like prefixed identifiers and strongly recommend using them from the beginning.

One question: I’ve seen people use both integer IDs (as the primary key) and prefixed GUIDs (for APIs) on the same table. This seems confusing and wasteful. Is there a valid reason to do that, and is it common? Performance when doing joins or something?

EDIT: it sounds like that’s the reason for them to be “K-Sortable”:

> TypeIDs are K-sortable and can be used as the primary key in a database while ensuring good locality. Compare to entirely random global ids, like UUIDv4, that generally suffer from poor database locality.

beyonddream 743 days ago [-]

This looks really interesting! Tangentially related if anyone is interested, I recently wrote[1] a pure C (no external dependency) version of coordination free, k-ordered 128-bit UUID Generator library which is inspired by snowflake but has bigger key space and few nifty features to protect against clock skew etc. The 128 bits are split into

  {timestamp:64, worker_id:48, seq: 16}

where the seq is incremented if the unique id is requested within the same millisecond.

[1] - https://github.com/beyonddream/snowid

VanillaCafe 743 days ago [-]

A pet nit, and the standards probably don't permit this, but for encoding 128-bit numbers, I prefer base-57 in my own implementations. 22 characters for a 128-bit encoding, same as base-64. You can split it into two 11-character 64-bit encodings. You can avoid the two non-alphanumeric characters in base-64 as well as the similar-looking characters like l1 and oO0. And it takes less visible space, so a bit easier for debugging and tabular output with otherwise no loss of generality.

timf 743 days ago [-]

I do a similar thing [1]. One of the great advantages to formally namespaced IDs is including a systematic conversion into strong types in your code. It's harder to accidentally mix things up when coding; function parameters and return tuples are more 'self documented' (and enforced by compiler where applicable).

[1] - https://www.peakscale.com/strongly-typed-ids/

koito17 743 days ago [-]

How does this compare to a SQUUID for sorting or nano-id for human readability? Both are options I've used in the past when using databases like Datomic or XTDB. SQUUIDs in particular because I have a UUID that can be ordered by timestamp, nano-id when prototyping things and I want meaningful prefixes in my entity IDs rather than a bunch of UUIDs.

BugsJustFindMe 743 days ago [-]

Skip the base32 because it's cargo culting and go with a higher base. The premise of crockford's chosen exclusions is fundamentally flawed. In arbitrary fonts, symbol collisions are arbitrary. There are plenty of fonts where any of 5 and S, 2 and Z, V and U, G and 6, 8 and B, and 9 and g are confusable. Likewise vv and w, nn and m.

ajanuary 743 days ago [-]

Neat, I like the type safe prefix idea.

Personally, I rarely find I need ids to be sortable, so I just go with pure randomness.

I also like to split out a part of the random section into a tag that is easier for me to visually scan and match up ids in logs etc.

I call my ID format afids [0]

[0] https://github.com/aJanuary/afid

nhumrich 743 days ago [-]

I feel validated. I wrote about this very problem before and came up with something similar https://dev.to/nhumrich/why-i-dont-like-uuids-5d9n

jtmarmon 743 days ago [-]

This looks great! Is there a reason one couldn't use this with v4 UUIDs? A quick test shows that they encode/decode just fine. Wondering if I could use the encoded form as a way to niceify our URLs without having to change how the IDs (currently v4 uuids) are stored

dloreto 743 days ago [-]

The CLI tool will support encoding/decoding any valid UUID, whether v1, v4, or v7. We picked v7 as the definition of the spec, because we need to choose one of them when generating a new random ID, and our opinion is that by default, that should be v7.

We might add a warning in the future if you decode/encode something that is not v7, but if it suits your use-case to encode UUIDv4 in this way, go for it. Just keep in mind that you'll lose the locality property.

BiteCode_dev 743 days ago [-]

Nice. I should use a similar id with ULID.

Although, if stored in a DB, I would probably split the tag and the id in 2 separate columns, because DBMS often have a dedicated efficient UUID field type.

A little client code wrapper can make that transparent.

wg0 743 days ago [-]

Can anyone guide me about the pros and cons of xid, ksuid and this type-safe option?

matthewfcarlson 743 days ago [-]

I actually implemented something similar for a personal project with a different syntax. It was two characters for the type, a colon, and then a UUID. Made it really easy to tell what primary key it corresponded to.

ongteckwux 743 days ago [-]

Did a Typescript port. Thanks for creating this! Very useful. https://github.com/ongteckwu/typeid-ts

743 days ago [-]

AtNightWeCode 743 days ago [-]

An important aspect of identifiers is to not leak any information in the identifier. In some scenarios a prefix might be fine but less important things have been blocked by our dpo department.

clintonb 743 days ago [-]

Requirements depend on the use case. I don’t consider the prefix a “leak” and neither does Stripe.

AtNightWeCode 743 days ago [-]

Yeah and Twitter IDs used to be 32-bit integers. In this case it is kind of obvious that the IDs leaks information about internal data types. Which is not ok at many companies.

jeremyjh 743 days ago [-]

Yeah, this idea makes me uneasy as well for the same reason. The natural conclusion would be to have DB frameworks automatically do this based upon code type names and that definitely feels like a leak, handing the world at large a roadmap to your internal system architecture.

changoplatanero 743 days ago [-]

what does k-sortable mean?

tomnipotent 743 days ago [-]

That if a hundred servers are generating (uuid, timestamp) tuples that are subsequently merged on a single machine, and sorted by uuid, it would have almost the same order as if sorted by timestamp.

This property is useful for RDBMS writes, when the UUID is used as a primary key and this locality ensures that fewer slotted pages need to be modified to write the same amount of data.

nzgrover 743 days ago [-]

Is that what they mean by "used as the primary key in a database while ensuring good locality"/"database locality"? That read/write access will hit fewer disk pages?

netcraft 743 days ago [-]

yes, exactly

nickjj 743 days ago [-]

> it would have almost the same order as if sorted by timestamp.

Is there documentation covering the scenarios on how the order can become out of sync and what the odds are? There's a big difference between "almost" and "always" if we're talking about using this as a database PK.

tomnipotent 743 days ago [-]

> There's a big difference between "almost" and "always"

Not in the context of an RDBMS, which use b+/b*-tree variants (or LSM sstables). Sequentially generated UUID's will end up near each other when sorted lexicographically, regardless of the fact that the sort order doesn't perfectly match the timestamp order.

michaelt 743 days ago [-]

The key seems to be based on UUIDv7, starting with a timestamp in milliseconds.

So the order can become out of sync if multiple events happen in the same millisecond; or if your servers' clock error is greater than a millisecond (i.e. if you're an NTP user)

More than sufficient for things like ordering tweets. If you're ordering bank account transactions, well, you'd probably be using transactions in an ACID-compliant relational database.

AtNightWeCode 743 days ago [-]

That it is a nearly sorted sequence. Typically the first part of the ID is a timestamp. So you may sort the IDs down to the second it was created. But for IDs created at the same time the order is random. This can be used for caching, database performance and so on.

sixtram 743 days ago [-]

[dead]

iillexial 743 days ago [-]

I didn't get the "type-safe" part. How would it work in Go?

Let's say I have structs:

type User struct { ID TypeID }

type Post struct { ID TypeID }

How can I ensure the correct type is used in each of the structs?

zeroxfe 743 days ago [-]

It's not a language primitive. It's a data format that enables type safety in libraries or APIs (as opposed to a more opaque data format like UUIDv7.)

kibwen 743 days ago [-]

Any time you ever read a string, its type is always just going to be "string" (modulo whatever passes for a "string" in your programming language of choice). To get an actual non-string type, you'd need to parse that string, and presumably your parsing function would read the prefix and reject the string if it was passed an ID whose type doesn't match. So it's dynamically type-safe, if not statically type-safe.

Merad 743 days ago [-]

I don't know go, but in C# I'd probably do something like the code below. The object really only needs to carry the uuid/guid, let the language type system worry about the difference between a user and post id. We just need a generic mechanism to ensure that UserId object can only be constructed from a valid type id string with type = user. For production use you'd obviously need more methods to construct it from a database tuple (mentioned elsewhere in the comments), etc.

    interface ITypeIdPrefix
    {
        static abstract string Prefix { get; }
    }

    abstract class TypeId<T>
        where T : TypeId<T>, ITypeIdPrefix, new()
    {
        public Guid Id { get; private init; }

        public override string ToString() => $"{T.Prefix}_{Id.ToBase32String()}";
        // Override GetHashcode(), Equals(), etc.

        public static bool TryParse(string s, out T? result)
        {
            if (!s.StartsWith(T.Prefix) || !TrySplitStringAndParseBase32ToGuid(s, out var id))
            {
                result = default;
                return false;
            }

            result = new T { Id = id };
            return true;
        }
    }

    class UserId : TypeId<UserId>, ITypeIdPrefix
    {
        public static string Prefix => "user";
    }

    class PostId : TypeId<PostId>, ITypeIdPrefix
    {
        public static string Prefix => "post";
    }

leetbulb 743 days ago [-]

One way is to enforce in Marshal[0] and Unmarshal[1]

[0] https://pkg.go.dev/encoding/json#Marshaler

[1] https://pkg.go.dev/encoding/json#Unmarshaler

avgcorrection 743 days ago [-]

It’s stringly-typed type-safety: check if the value has the expected prefix.

hfkwer 743 days ago [-]

This isn't about object types in any particular language.

quelltext 742 days ago [-]

Wouldn't it make sense to allow encoding additional data in the suffix? For instance a sharding key (or whatever you wanna call it)?

jhoechtl 743 days ago [-]

I am no big fan of the semantic web but the prefix should be an URI to be able to segment the domain. A single prefix is IMHO not enough.

onlypositive 743 days ago [-]

> Compare to entirely random global ids, like UUIDv4, that generally suffer from poor database locality.

What does this mean in more words?

armchairhacker 743 days ago [-]

Because the bytes are all random, UUIDv4 is sorted randomly. So whenever you insert a database entry with a new UUID, it ends up getting put in some random memory location.

In practice, you often want to select database entries which were inserted near each other in time. Ex: you would like to select the most recent entries or entries within a time frame. Even when selecting entries by other information, entries inserted closer to each other in time are generally more likely to be related.

Fetching entries near each other in memory is faster, so it would be nice to insert entries sequentially in time; we want entries inserted near each other in time to be located near each other in memory.

This is what counter-based indexing does: the database has a counter which increments on each insertion, and the current value becomes the inserted entry's id. But the problem with counters is when the database is distributed and insertions are happening in parallel, and you definitely don't want to sync the counter because that's way too slow.

UUIDv7 combines a sort of counter (Unix time) with a randomly-generated number. The counter bytes are first, so they determine the sort order; but in case 2 entries get inserted at the same time, the randomly-generated number keeps them distinct and totally ordered.

hot_gril 742 days ago [-]

Right, but worth noting that distributed DBs don't necessarily play nice with sequential pkeys. Spanner explicitly tells you not to use a time-based pkey.

ukuina 743 days ago [-]

Why is it beneficial to sort random IDs?

davidjfelix 743 days ago [-]

So you don't have to create an additional index to scan them in a somewhat sensible order. createdAt just happens to be a naturally decent scan order.

743 days ago [-]

TeeWEE 743 days ago [-]

Good, but I dont see a big advantage over UUIDv7 Anyone has some good ones?

dloreto 743 days ago [-]

It's based on UUIDv7 (in fact, a TypeID can be decoded into an UUIDv7). The main reasons to use TypeID over "raw" UUIDv7 are: 1) For the type safety, and 2) for the more compact string encoding.

If you don't need either of those, then UUIDv7 is the right choice.

Loading comments...

klabb3 743 days ago [-]

A couple of suggestions:

The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

As for the actual design decisions, I tried to poke holes but I fold! I think this strikes the sweet spot between the different tradeoffs. Well done!

dloreto 743 days ago [-]

Thanks for the feedback!

Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

dolmen 743 days ago [-]

There is no tests.

There is just a single test. Which only tests the decoding of a single known value. No encoding test.

Go has infrastructure for benchmarking and fuzzing. Use it!

dloreto 742 days ago [-]

We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...

Thanks for the feedback!

klabb3 743 days ago [-]

> We have tests for the base32 encoding which is the most complicated part of the implementation

> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

aftbit 743 days ago [-]

It was chosen to avoid a number of the most annoying ambiguous letter shapes for hand-entry of long address strings.

https://en.bitcoin.it/wiki/Base58Check_encoding

bredren 743 days ago [-]

Reminds me that Windows activation keys used to exclude a broader set of characters to avoid transcription errors: looking it up again: 0OI1 and 5AELNSUZ

unshavedyak 743 days ago [-]

Is there a good reason why Base63 (nopad) doesn't exist? Ie Base64 minus the `-`, so that you almost get the density of base64 (nopad) but the double click friendly feature.

I thought maybe it was performance oriented. An odd prefix length like base63 would mean .. i think, a slightly more computationally demanding set of encoding instructions?

Either way i basically want base58 but i don't care about legibility, i just wanted double click and url friendly characters.

fauigerzigerk 743 days ago [-]

>Is there a good reason why Base63 (nopad) doesn't exist? Ie Base64 minus the `-`, so that you almost get the density of base64 (nopad) but the double click friendly feature.

Base32 can represent 5 bits per character because log2(32) == 5. Anything in between 32 and 64 doesn't buy you anything because there is no integer between 5 and 6.

unshavedyak 742 days ago [-]

Is that "just" a performance concern though? Ie why is there a base58 and base62 but no base63?

Now you've got me curious on the performance of base58 to base64 hah. Down the rabbit hole i go. Appreciate your reply, thanks :)

TedDoesntTalk 743 days ago [-]

It’s not too difficult to write your own encoding. Probably 10 lines of code or less if you hard-code your encoding alphabet.

kbumsik 742 days ago [-]

What does “double clickable” mean?

d0mine 742 days ago [-]

Whether "double click" selects the whole id.

kbumsik 743 days ago [-]

> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

It would be great if you add suggestions for compound types (like “article-comment”) in README as OP stated as well.

spoiler 743 days ago [-]

But also, I'm bike-shedding and its only an ID

kbumsik 742 days ago [-]

I like using "." for this case. Because types definition typically belong to a package or module, which commonly uses "." for separator.

avgcorrection 743 days ago [-]

> The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

Yep. The readme asks people to provide other implementations. Having a test suite would be good for third-party code.

dloreto 742 days ago [-]

A follow up:

1. We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...

2. I clarified the prefix in the spec

Thanks for the feedback!

rtheunissen 743 days ago [-]

I'll write some tests for this.

jhoechtl 743 days ago [-]

I think your misconception is that prefix is sthg. fixed? You decided on the prefix depending on the usage domain.

tomcam 743 days ago [-]

> Add tests goddammit!

Hey, you’re pretty smart. How about you add them?

klabb3 743 days ago [-]

I’m by no means a test police. I’m in fact opposed to a lot of mindless testing for the sake of it. But there are places where unit tests shine, and this is one of them.

tomcam 742 days ago [-]

Do you need to be “test police“ to add tests you literally demanded of them?

aartav 743 days ago [-]

I've been doing this kind of thing for years with two notable differences:

sokoloff 743 days ago [-]

> base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.

We had “analrita” as an autogenerated password that resulted in a complaint many years ago. Might consider adding ‘a’ as an excluded letter.

michaelt 743 days ago [-]

Presumably base 32 means 26 letters + 10 digits - 4 banned letters

So adding an excluded letter is not easy.

sokoloff 743 days ago [-]

Why not use base-31 and (optionally) more characters? (Or go upper and lower or add a symbol if you had to stay with a fixed-size and base-32 for some reason)

chipsa 743 days ago [-]

Because Base32 is just bit shifting and then converting the 5 bits into a char, and vice versa. Doing Base31 requires base conversion.

manquer 743 days ago [-]

Wouldn’t that be excluded because i is already removed ?

jhgg 743 days ago [-]

I don't think they took offense to the "rita" part.

bombcar 743 days ago [-]

analratassart would probably be just as bad

EGreg 743 days ago [-]

ianal

spoiler 743 days ago [-]

rejiswe 742 days ago [-]

[flagged]

743 days ago [-]

tlrobinson 743 days ago [-]

I agree with the addition of the checksum, however I’m curious:

> either by accident or malice

2. How does a 2 byte (non-cryptographically secure) checksum help in the case of malice?

dloreto 743 days ago [-]

The checksum idea is interesting. I'm considering whether it makes sense to add it as part of the TypeID spec.

veec_cas_tant 743 days ago [-]

What value does the checksum provide? I think I'm missing something because I really don't see a benefit.

diroussel 739 days ago [-]

The benefit is that you can reject bad requests to an API more easily.

So in the age of public internet faceing APIs and app urls, I think built in optional check digit support is a good idea.

veec_cas_tant 739 days ago [-]

I struggle to see how 10 bits of check data will help much. I guess if the extra bits aren’t persisted to storage it doesn’t hurt so why not?

diroussel 738 days ago [-]

bumbledraven 743 days ago [-]

Checksums facilitate error detection. For typed UUIDs, checksums help detect errors introduced by changing the prefix/type or changing a “digit”.

zrail 743 days ago [-]

ajkjk 743 days ago [-]

kibwen 743 days ago [-]

> The page says the reason for excluding U is "accidental obscenity'.

pluijzer 743 days ago [-]

You could argue that U can be confused with V.

mtlmtlmtlmtl 743 days ago [-]

A vaguely related historical tangent is that V and U used to be just two ways of writing the same letter in Early Modern English. Which I imagine is why W is named as "double U" in speaking.

jabbany 743 days ago [-]

This is also interesting since in French (and I think Spanish?) W is (correctly) called "double V"

mtlmtlmtlmtl 743 days ago [-]

Same thing in Norwegian. But W is also not really a part of Norwegian orthography, it's just kind of there in the alphabet anyway. Only useful for names and maybe a couple loanwords.

namtab00 743 days ago [-]

In Romanian and Italian too.

TRiG_Ireland 743 days ago [-]

That is indeed why w is named "double u" in English (and "double v" in French. And why it's a vowel in Welsh.

See more from jan Misali: https://www.youtube.com/watch?v=sg2j7mZ9-2Y

EGreg 743 days ago [-]

Same with B and V, P and F, SS and Z, etc.

It is why PH is F, for instance.

dmurray 743 days ago [-]

5 and S seems more likely.

quickthrower2 743 days ago [-]

Fvck!

743 days ago [-]

dolmen 743 days ago [-]

This assumes that english is the only relevant language regarding curse words. Which is quite biased.

quickthrower2 743 days ago [-]

U is a fairly new letter anyway.

codeulike 743 days ago [-]

titanomachy 743 days ago [-]

Wow I didn't know HN even had obscenity filters, and I've been here for many years.

Guess that's a credit to the general civility of the community.

taberiand 743 days ago [-]

No obscenity filters, but there is a pretty good password filter I hear. For example, my password 'hunter2' will be all **** to you

9dev 743 days ago [-]

Isn’t it nice how some traditions do stick around. It’s been a while, Cthon98!

redler 743 days ago [-]

Sadly, this is one of those things that, soon, if not now, only the olds will know about.

astrospective 743 days ago [-]

Like telling people asking for game cheats to use Alt-F4.

rbera 743 days ago [-]

macintux 743 days ago [-]

I assumed this was a riff on the classic bash.org transcript.

http://www.bash.org/?244321

Aeolun 743 days ago [-]

But we know there are no server side filters here. At least I never had any profanity censored by anything other than the Mk 1 brain.

743 days ago [-]

AceJohnny2 743 days ago [-]

you accidentally the whole thing

EGreg 743 days ago [-]

I think the point

pavlov 743 days ago [-]

True Latinists find the letter U vulgar to the point of obscenity because it didn’t exist in Cicero’s time.

oleganza 743 days ago [-]

Trve Latinists wovld appreciate yovr point.

littlestymaar 743 days ago [-]

Gotcha, there was no “W” in the Latin alphabet either ;)

Zamicol 743 days ago [-]

There's more!

- base 58 - Satoshi's/Bitcoin's https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58

- "base62" - Keybase's saltpack https://github.com/keybase/saltpack

- The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85

- basE91 - https://base91.sourceforge.net

At work we defined several new "bases" for QR code. IMHO, it is an under applied area of computer science.

hinkley 743 days ago [-]

tggr is just cute, n**r is an uncomfortable conversation with multiple HR teams (we were B2B)

hinkley 743 days ago [-]

Y is pretty versatile for pissing people off.

theptip 743 days ago [-]

The Scunthorpe problem?

https://en.m.wikipedia.org/wiki/Scunthorpe_problem

pizzapill 743 days ago [-]

E-Mail accounts seem the worst. Just lets write letters again, if you need a pencil I recommend penisland.net

TedDoesntTalk 743 days ago [-]

Penis mean “tail” in Latin.

pizzapill 743 days ago [-]

As they say Your pen is Our business!

programmarchy 743 days ago [-]

There's enough comedic content in this article for several Silicon Valley episodes.

jszymborski 743 days ago [-]

FUCK

Racing0461 743 days ago [-]

yep, youtube video ids has/had? same issue where it would have things like fag/f4g etc in it.

eg: google "allinurl:fag site:youtube.com"

stronglikedan 743 days ago [-]

MR4D 743 days ago [-]

Obscenities change with language. That’s why every language has them.

Even programming language have them. For instance, Basic has GOTO.

ZeroClickOk 743 days ago [-]

and javascript has type coercion

arcticbull 743 days ago [-]

[1] https://newsfeed.time.com/2013/12/17/delta-airlines-is-very-...

kortex 743 days ago [-]

https://www.dailymail.co.uk/news/article-2039662/Virginia-dr...

taosx 743 days ago [-]

Why we care about obscenity in pseudo-random ids and url?

whimsicalism 743 days ago [-]

anglo morals

Racing0461 743 days ago [-]

i don't, the people that can start a **storm on twitter/tumblr causing your stock to drop 10% does.

oleganza 743 days ago [-]

nephanth 743 days ago [-]

Are those supposed to be user- facing though?

lolinder 743 days ago [-]

Using Stripe-style prefixing is most useful if they are—it makes interacting with customer support easier because everyone is on the same page about what this ID identifies.

avgcorrection 743 days ago [-]

> The page says the reason for excluding U is "accidental obscenity'. What the heck is it talking about?

Because he’s an American?

deanmen 743 days ago [-]

The F word has a U in it. Sure you could just say FVCK

bongobingo1 743 days ago [-]

Or Fwck if you doubly mean it.

743 days ago [-]

programmarchy 743 days ago [-]

Yeah, wtf?

atonse 743 days ago [-]

Obviously by “wtf” you must mean “why the face?” Right? Right?? :-)

inopinatus 743 days ago [-]

Edit to add: to be fair to Douglas Crockford, his encoding of base 32 was designed two decades ago, when the usage landscape looked quite different.

dloreto 743 days ago [-]

I hear you ... and I debated using either base58 or base64url. I do like the more compact encoding they provide.

742 days ago [-]

ash 743 days ago [-]

I agree that pronouncing identifiers over the phone is rare. But I’m occasionally typing identifiers from:

1. a screenshot or a screen share that contains an identifier

2. another device where I can’t easily take an identifier

inopinatus 743 days ago [-]

Lazare 743 days ago [-]

I agree; base58 or base62 (which KSUIDs use) have a lot to recommend them. Crockford's base32 works, but I don't love it.

My first choice would be to just use type-prefixed KSUIDs, which gives you 160-bit K-sortable IDs with base62 encoding, which works great unless you need 128-bit IDs for compatability reasons.

yencabulator 742 days ago [-]

Wait, where's the hyphen in Crockford Base32? https://en.wikipedia.org/wiki/Base32#Crockford's_Base32

My favorite base-32 encoding is z-base-32, which I find just gentler on the eyes: https://philzimmermann.com/docs/human-oriented-base-32-encod...

The biggest problems with base58 are 1) it works for integers, less so for arbitrary binary data like crypto keys 2) case-sensitivity ISnOtNIcEtOLoOKaT (in my opinion).

inopinatus 742 days ago [-]

The specification of crockford32 is at https://www.crockford.com/base32.html

yencabulator 742 days ago [-]

Ah, "Hyphens (-) can be inserted into symbol strings." If you don't like 'em, don't use 'em.

742 days ago [-]

stephen 743 days ago [-]

Neat! Love the "type-safe" prefix; we'd called them "tagged ids" in our ORM that auto-prefixes the otherwise-ints-in-the-db with similar per-entity tags:

https://joist-orm.io/docs/advanced/tagged-ids

We'd used `:` as our delimiter, but kinda regretting not using `_` because of the "double-click to copy/paste" aspect...

wongarsu 743 days ago [-]

stephen 743 days ago [-]

Nice!

The `t<X>` makes sense; we currently guess a tag name of "FooBarZaz" --> "fbz", but allow the user to override it, so you could hand-assign "t1", "t2", etc. as you added entities to the system.

hamburglar 743 days ago [-]

wood_spirit 743 days ago [-]

UUIDv7 has been taking HN by storm for years now! When is it going to become a proper standard, and when are libraries and databases and all the rest going to natively support it?

vbezhenar 743 days ago [-]

dolmen 743 days ago [-]

IDs generation is usually private to a company scope and rarely need to be "universally unique".

kijeda 743 days ago [-]

It would appear to be in the final stages of standardization in the IETF: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...

Daegalus 743 days ago [-]

Nelkins 743 days ago [-]

How much does it change between drafts? Close enough to where I could use it in production?

Daegalus 743 days ago [-]

Seeing as how its nearly done, it doesn't change much. It changed more often in the beginning, but its like on its final draft, or near final draft. I think the IETF plans to make final soon.

eezing 743 days ago [-]

“…can be selected for copy-pasting by double-clicking”

Details matter.

bombela 743 days ago [-]

And in my experience most people somehow think a UUID must be stored into the human friendly hex representation, dashes included. Wasting so much space in database, network, memory.

rjh29 743 days ago [-]

Many people had the same idea. For example ULID https://github.com/ulid/spec is more compact and stores the time so it is lexically ordered.

jerf 743 days ago [-]

While this isn't the worst area I see this in, there does seem to be a tendency in the UUID space to speak as if one use case stands for all and therefore there is a best UUID format.

The reality is that it is just like any other engineering situation. Sit down, write down your requirements, and see what, if anything, solves it.

If you have a use case where there's an existing solution then, hey, great, go ahead and use it. Maybe if anyone ever needs that but in another language they can pull a library there too.

stronglikedan 743 days ago [-]

> combining time + random number

You can't guarantee that this will be globally unique.

ceejayoz 743 days ago [-]

No identifier can guarantee that. We just get close enough to be acceptable.

Per Wikipedia, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

so-youre-saying-theres-a-chance.gif

jandrewrogers 743 days ago [-]

I have single datasets with trillions of UUID. Collision probability becomes a thing.

vore 743 days ago [-]

A 1 in a billion collision probability?

743 days ago [-]

duped 743 days ago [-]

A billion is not that big of a number for UUIDs

ceejayoz 743 days ago [-]

Re-read. You'd have to generate 103 trillion to have a one billionth chance of a collision.

A billion isn't that big a number, but 103 trillion is.

jandrewrogers 743 days ago [-]

I think you made a mistake in your math. The Birthday Collision probability of just a trillion random UUID is much higher than that.

ceejayoz 743 days ago [-]

deathanatos 743 days ago [-]

Wikipedia is correct, AFAICT.

The probability of 1 trillion UUIDs having a collision is,

  def birthday_collision(n, m):
      return 1 - math.e ** (-((n -1) * n) / (2 * m))

  In : birthday_collision(1_000_000_000_000, 2 ** 122)
  Out: 9.403589018575076e-14

That number is roughly the approximation given in Wikipedia.

I.e., at 1T UUIDs, it hasn't happened. For comparison, the odds of being struck by lighting (over a lifetime) is many orders of magnitude greater:

  6.535947712418301e-05

hot_gril 741 days ago [-]

The only worthwhile UUID standard IMO is v4 (simple random), and I still don't get why it needs dashes. The other ones don't really accomplish anything.

deepsun 743 days ago [-]

The worst thing about dashes is you cannot easily double-click it whole to copy-paste.

Xeoncross 743 days ago [-]

ksuid is the best general purpose id generator with sort-able timestamps I've found and has libraries in most languages. UUID v1-7 are wasteful.

sophiabits 743 days ago [-]

I have a thin JavaScript package published which generates type-prefixed KSUIDs. KSUID is a great format assuming you can spare the extra bits (which is most people)

[1] https://github.com/sophiabits/resource-id

rtheunissen 743 days ago [-]

I moved to ULID because they are always lowercase and therefore case-insensitive.

Lazare 743 days ago [-]

ULIDs work, but:

The ULID spec is, a best, a bit underbaked. (And as others have noted, I find that Crockford's base32 encoding is a suboptimal choice.)

tasn 743 days ago [-]

1: https://github.com/svix/rust-ksuid 2: https://github.com/svix/python-ksuid

Xeoncross 743 days ago [-]

I'm curious, why do you not store these as binary data or do you and you're saying that the UUID operations are better optimized than sorts on binary data?

snuxoll 743 days ago [-]

I can compare a 128bit UUID in a single instruction, a 160-bit ksuid is a little weirder to work with at the hardware level.

tasn 743 days ago [-]

Exactly what the sibling said, and the same applies to database operations (when they have a uuid type).

lll-o-lll 743 days ago [-]

With UUID v7 we are back to 32bits of actually random data. It’s going to require some good educating to teach devs that uuids are guessable.

atonse 743 days ago [-]

This brings up an interesting ergonomics problem.

By naming them UUIDv4 and UUIDv7, is it going to be this never ending confusion for people to have to remember which one is good for databases and which one good for one time tokens?

Not sure what the backwards compatible solution here is either.

In elixir the function is UUID.uuid4() to generate a v4 UUID.

So we could theoretically scan code for its use I suppose. But all this increases chances of errors.

hot_gril 742 days ago [-]

> is it going to be this never ending confusion for people to have to remember which one is good for databases and which one good for one time tokens

Yes, because this is what's been happening already with the past versions. It's not just sequential and random, there are also hash-based UUIDs. They shouldn't have sequential (heh) version numbers.

hot_gril 742 days ago [-]

vikeri 743 days ago [-]

Another, less known, useful thing about these IDs is that you can double click on them and the full id will always be selected

Eduard 743 days ago [-]

Also, they are safe to use within filenames and directory names (Filesystem paths) without conversion (at least in today's Filesystem not limited to e.g. 8.3 characters) .

Compare that with otherwise nice ISO 8601 datetime format (e.g. 2023-06-28T21:47:59+00:00): it requires conversion for file systems that don't allow colons and plus signs.

jrockway 743 days ago [-]

This is a setting in your terminal emulator. For me, plain UUIDs are selected just fine when double clicking.

mojuba 743 days ago [-]

There's life outside of the terminal. For example you want to double-click on the part of a URL in your browser.

avarun 743 days ago [-]

It's mentioned in the README. "can be selected for copy-pasting by double-clicking"

hot_gril 743 days ago [-]

The UUIDv7 properties are interesting, but it's worth noting that at least one DBMS really doesn't like the K-sortable property: https://cloud.google.com/spanner/docs/schema-design#uuid_pri...

quelltext 742 days ago [-]

Super useful and I honestly don't wanna go back to a project that doesn't use this pattern.

hot_gril 742 days ago [-]

743 days ago [-]

swyx 743 days ago [-]

for those researching this topic, I keep a list of these UUID/GUID implementations!

https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...

crdrost 743 days ago [-]

Thanks for this!

So if we use hexits for the encoding the idea would be that 0=0, 1=1, ... E=14, then

    F00 = 15
    F01 = 16
    ...
    F0F = 30
    F100 = 31
    F101 = 32
    ...
    F1FF = 286
    F2000 =

so the format is F, which is the overflow sigil, followed by a recursive representation of the length of the coming string, followed by a string of hexits that long.

What if you need 16 hexits? That's where the recursion comes in,

    F F00 0123456789ABCDEF
     \   \    \
      \   \    \----- 16 hexits
       \   \
        \   \-- the number 15, "there are 15+1 digits to follow"
         \       (consisting of overflow, 0+1 digits to follow, and hex 0)
          \ 
           \--- overflow sigil

Kind of goofy but would allow a bunch of things like "timestamp * 1024 + 10-bit machine ID" etc without worrying about the size of the numbers involved

Ironchefpython 743 days ago [-]

https://datatracker.ietf.org/doc/html/rfc2550

Note that this RFC also supports lexicographic sorting for negative numbers using the 10k complement notation.

CMCDragonkai 743 days ago [-]

We have a uuidv7 implementation that we've been using with rocksdb for over a year https://github.com/matrixai/js-id

dap 743 days ago [-]

hot_gril 741 days ago [-]

Yep. Spanner docs even tell you not to use time-ordered keys.

kortex 743 days ago [-]

carlsverre 743 days ago [-]

The authors have created a specialisation for Postgres that leverages a custom type which is a tuple of type and uuidv7: https://github.com/jetpack-io/typeid-sql/blob/main/sql/typei...

hot_gril 742 days ago [-]

sophiabits 743 days ago [-]

Not implementing this isn’t the end of the world, but it makes refetching individual bits of data in your cache far easier. It’s really nice being able to implement it basically for free.

[1] https://graphql.org/learn/global-object-identification/

jszymborski 743 days ago [-]

This is very similar to how I generate IDs in a project I'm working on.

Example:

    |-A-|-|------------B--------------|
    NMSPC-9TWN1-HR7SV-MTX00-0H8VP-YCCJZ
    A = Namespace, padded to 5 chars. Max 5 chars. Uppercase.
    B = Blake3 hashed microtime with a random key.

I like how it folds in a time component but that it also doesn't reveal the time it was generated.

Here's the snippet: https://gist.github.com/jszym/d3c7907b7b6e916f68205c99e5e489...

goostavos 743 days ago [-]

rowbin 742 days ago [-]

pphysch 743 days ago [-]

So it's got all of the above perks, but it also an actual UUID and fits in a Postgres UUID column.

It's very cool to be able to resolve a UUID to a particular database table and record with almost zero performance overhead (cached table lookup + indexed record select).

Daegalus 743 days ago [-]

Ideally, you should version that as UUIDv8. Since that is a more custom implementation, but I guess changing the random bits with type info works fine for UUIDv7, they jsut arent random anymore.

pphysch 743 days ago [-]

Not every bit of UUID is required to be random.

The goals are smallness, uniqueness, monotonicity, resistance to enumeration attacks, etc. Not randomness for randomness sake.

My UUIDv7+ can be consumed as a standard UUIDv7. It is not intended to be v8. A program can treat the last 16 bits as random noise if it wants.

hfkwer 743 days ago [-]

> It is not intended to be v8.

There is already a UUIDv8. It's defined as vendor-specific UUID. https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

> Some example situations in which UUIDv8 usage could occur:

> An implementation would like to embed extra information within the UUID other than what is defined in this document.

Isn't that exactly what you are doing?

Daegalus 743 days ago [-]

I am aware, just saying per spec, its supposed to be random bit data, thats all I was saying. I am familiar with a spec since I maintain a UUID library that has 6,7, and a custom 8 implemented.

pphysch 743 days ago [-]

> per spec, its supposed to be random bit data

> It can have extra monotonicity data instead

Well, which is it? These are incompatible requirements.

Daegalus 743 days ago [-]

Well, that might be an ambiguity that needs to be brought up before its final if it is an issue.

So if we look at https://www.ietf.org/archive/id/draft-ietf-uuidrev-rfc4122bi...

For list item #3 it says "Random data for each new UUIDv7 generated for any remaining space." without the word "optional" and the bit layout diagram says `rand_b`

Reading section 6.2 https://www.ietf.org/archive/id/draft-ietf-uuidrev-rfc4122bi..., it all involves incrementing counters, or other monotonic random data.

I think I just got too caught up on the the bit layout calling it `random` and misread your information. Sorry for the misunderstanding, and thanks for discussing it.

jeremyjh 743 days ago [-]

Well UUIDv7 can be consumed as a UUIDv4 in the same way, its just 16 bytes. The point of the standard is to define how the particular bytes are chosen.

pphysch 743 days ago [-]

The latest standard for v7 does not meaningfully describe how to interpret the last segment.

To be clear, the standard should be amended to resolve this ambiguity. Say the last bits MAY be monotonic or MAY be pseudorandom. Or add a flag that indicates which.

As there is currently no standard way to interpret these bits, I feel perfectly justified in using the a few of the least significant ones to encode additional information.

jeremyjh 743 days ago [-]

pphysch 743 days ago [-]

> you can use it everywhere and know that keys are assigned the same way regardless of which software stack is generating a particular key.

But even if you follow the standard to a tee, you cannot infer anything about how the last 62 bits were assigned. That is my point!

dralley 743 days ago [-]

Postgresql doesn't care, it's not going to "interpret" those bits, it is just a 128-bit integer.

pphysch 743 days ago [-]

And I'm glad for it, because I could implement this without needing an extension or update to PostgreSQL.

atulvi 743 days ago [-]

Naive Question: The type safe part is just appending a string at the beginning? What if I do that with UUIDv4? is user_49b9cd12-9964-4b9c-8512-742f0a2c9be4 type safe now?

dloreto 743 days ago [-]

That's how the type is encoded as a string, but type-safety ultimately comes from how the TypeID libraries allow you to validate that the type is correct.

davidjfelix 743 days ago [-]

Example:

With prefixed ids that indicate type, a function that assumes type differently from the one supplied will not accidentally succeed.

In addition, humans who copy ids will be less likely to mistake them. This is just an ergonomic/human centric typing.

darajava 743 days ago [-]

0. https://www.crockford.com/base32.html

autoexec 743 days ago [-]

743 days ago [-]

tlrobinson 743 days ago [-]

I really like prefixed identifiers and strongly recommend using them from the beginning.

EDIT: it sounds like that’s the reason for them to be “K-Sortable”:

beyonddream 743 days ago [-]

  {timestamp:64, worker_id:48, seq: 16}

where the seq is incremented if the unique id is requested within the same millisecond.

[1] - https://github.com/beyonddream/snowid

VanillaCafe 743 days ago [-]

timf 743 days ago [-]

[1] - https://www.peakscale.com/strongly-typed-ids/

koito17 743 days ago [-]

BugsJustFindMe 743 days ago [-]

ajanuary 743 days ago [-]

Neat, I like the type safe prefix idea.

Personally, I rarely find I need ids to be sortable, so I just go with pure randomness.

I also like to split out a part of the random section into a tag that is easier for me to visually scan and match up ids in logs etc.

I call my ID format afids [0]

[0] https://github.com/aJanuary/afid

nhumrich 743 days ago [-]

I feel validated. I wrote about this very problem before and came up with something similar https://dev.to/nhumrich/why-i-dont-like-uuids-5d9n

jtmarmon 743 days ago [-]

dloreto 743 days ago [-]

BiteCode_dev 743 days ago [-]

Nice. I should use a similar id with ULID.

Although, if stored in a DB, I would probably split the tag and the id in 2 separate columns, because DBMS often have a dedicated efficient UUID field type.

A little client code wrapper can make that transparent.

wg0 743 days ago [-]

Can anyone guide me about the pros and cons of xid, ksuid and this type-safe option?

matthewfcarlson 743 days ago [-]

ongteckwux 743 days ago [-]

Did a Typescript port. Thanks for creating this! Very useful. https://github.com/ongteckwu/typeid-ts

743 days ago [-]

AtNightWeCode 743 days ago [-]

An important aspect of identifiers is to not leak any information in the identifier. In some scenarios a prefix might be fine but less important things have been blocked by our dpo department.

clintonb 743 days ago [-]

Requirements depend on the use case. I don’t consider the prefix a “leak” and neither does Stripe.

AtNightWeCode 743 days ago [-]

Yeah and Twitter IDs used to be 32-bit integers. In this case it is kind of obvious that the IDs leaks information about internal data types. Which is not ok at many companies.

jeremyjh 743 days ago [-]

changoplatanero 743 days ago [-]

what does k-sortable mean?

tomnipotent 743 days ago [-]

That if a hundred servers are generating (uuid, timestamp) tuples that are subsequently merged on a single machine, and sorted by uuid, it would have almost the same order as if sorted by timestamp.

This property is useful for RDBMS writes, when the UUID is used as a primary key and this locality ensures that fewer slotted pages need to be modified to write the same amount of data.

nzgrover 743 days ago [-]

Is that what they mean by "used as the primary key in a database while ensuring good locality"/"database locality"? That read/write access will hit fewer disk pages?

netcraft 743 days ago [-]

yes, exactly

nickjj 743 days ago [-]

> it would have almost the same order as if sorted by timestamp.

tomnipotent 743 days ago [-]

> There's a big difference between "almost" and "always"

michaelt 743 days ago [-]

The key seems to be based on UUIDv7, starting with a timestamp in milliseconds.

So the order can become out of sync if multiple events happen in the same millisecond; or if your servers' clock error is greater than a millisecond (i.e. if you're an NTP user)

More than sufficient for things like ordering tweets. If you're ordering bank account transactions, well, you'd probably be using transactions in an ACID-compliant relational database.

AtNightWeCode 743 days ago [-]

sixtram 743 days ago [-]

[dead]

iillexial 743 days ago [-]

I didn't get the "type-safe" part. How would it work in Go?

Let's say I have structs:

type User struct { ID TypeID }

type Post struct { ID TypeID }

How can I ensure the correct type is used in each of the structs?

zeroxfe 743 days ago [-]

It's not a language primitive. It's a data format that enables type safety in libraries or APIs (as opposed to a more opaque data format like UUIDv7.)

kibwen 743 days ago [-]

Merad 743 days ago [-]

    interface ITypeIdPrefix
    {
        static abstract string Prefix { get; }
    }

    abstract class TypeId<T>
        where T : TypeId<T>, ITypeIdPrefix, new()
    {
        public Guid Id { get; private init; }

        public override string ToString() => $"{T.Prefix}_{Id.ToBase32String()}";
        // Override GetHashcode(), Equals(), etc.

        public static bool TryParse(string s, out T? result)
        {
            if (!s.StartsWith(T.Prefix) || !TrySplitStringAndParseBase32ToGuid(s, out var id))
            {
                result = default;
                return false;
            }

            result = new T { Id = id };
            return true;
        }
    }

    class UserId : TypeId<UserId>, ITypeIdPrefix
    {
        public static string Prefix => "user";
    }

    class PostId : TypeId<PostId>, ITypeIdPrefix
    {
        public static string Prefix => "post";
    }

leetbulb 743 days ago [-]

One way is to enforce in Marshal[0] and Unmarshal[1]

[0] https://pkg.go.dev/encoding/json#Marshaler

[1] https://pkg.go.dev/encoding/json#Unmarshaler

avgcorrection 743 days ago [-]

It’s stringly-typed type-safety: check if the value has the expected prefix.

hfkwer 743 days ago [-]

This isn't about object types in any particular language.

quelltext 742 days ago [-]

Wouldn't it make sense to allow encoding additional data in the suffix? For instance a sharding key (or whatever you wanna call it)?

jhoechtl 743 days ago [-]

I am no big fan of the semantic web but the prefix should be an URI to be able to segment the domain. A single prefix is IMHO not enough.

onlypositive 743 days ago [-]

> Compare to entirely random global ids, like UUIDv4, that generally suffer from poor database locality.

What does this mean in more words?

armchairhacker 743 days ago [-]

Because the bytes are all random, UUIDv4 is sorted randomly. So whenever you insert a database entry with a new UUID, it ends up getting put in some random memory location.

hot_gril 742 days ago [-]

Right, but worth noting that distributed DBs don't necessarily play nice with sequential pkeys. Spanner explicitly tells you not to use a time-based pkey.

ukuina 743 days ago [-]

Why is it beneficial to sort random IDs?

davidjfelix 743 days ago [-]

So you don't have to create an additional index to scan them in a somewhat sensible order. createdAt just happens to be a naturally decent scan order.

743 days ago [-]

TeeWEE 743 days ago [-]

Good, but I dont see a big advantage over UUIDv7 Anyone has some good ones?

dloreto 743 days ago [-]

It's based on UUIDv7 (in fact, a TypeID can be decoded into an UUIDv7). The main reasons to use TypeID over "raw" UUIDv7 are: 1) For the type safety, and 2) for the more compact string encoding.

If you don't need either of those, then UUIDv7 is the right choice.