Arc Forumnew | comments | leaders | submitlogin
Nuit, an indented S-expression format optimized for lots of text (github.com)
3 points by Pauan 4837 days ago | 28 comments


3 points by Pauan 4834 days ago | link

I just finished and pushed out the new and improved regexp implementation of Nuit. It went from this unmaintainable ugly hacky 240 line mess that even I couldn't understand:

https://github.com/Pauan/ar/blob/de25e2b3a6c99496bacbb18203b...

To this lovely clean elegant 157 line beauty which uses an awesome pseudo-continuation-passing-style:

https://github.com/Pauan/ar/blob/08bbc89c65bec5c64a0fc298df3...

Although it doesn't look like it uses regexps much, in the few areas that it does use them, it's a big help. The primary benefit of the rewrite was my cleanup and improvement of the pseudo-continuation-passing-style.

That is now a 100% complete Arc implementation of the Nuit spec. All it needs is a serializer and JSON/YAML meta-format. I also just realized that you could write a meta-format for Nuit that would blow the pants off of Markdown: it'd be so awesome!

I also rewrote basically all the non-Unicode stuff in the spec. Not much really changed, I just wrote down a bunch of stuff that was previously implied but unwritten. I also tried to be much more precise about things like indentation rules.

---

With that out of the way... I can now focus on rewriting my playlist program. I expect it to be much more convenient and much faster once I'm done.

-----

2 points by rocketnia 4834 days ago | link

You now have this JSON example in your spec:

  "foo\u20\20ACbar"
JSON unicode escapes have four hexadecimal characters (per RFC 4627), and you're missing a u.

  "foo\u0020\u20ACbar"
You now say "It is invalid for a non-empty line to be indented if it is not within a list, comment, or string," but the whole file is in a list. :-p

The lack of informative commit messages made it difficult to catch up with the spec changes. Just saying. ^_^

-----

1 point by Pauan 4833 days ago | link

"JSON unicode escapes have four hexadecimal characters (per RFC 4627), and you're missing a u."

Ah, thanks! I'll fix that right away. The reason I left off the leading 0s is that some languages (like Racket) let you do that. Always using 4 is less ambiguous and more portable, so I'll change that.

---

"You now say "It is invalid for a non-empty line to be indented if it is not within a list, comment, or string," but the whole file is in a list. :-p"

You and your pedantry! The top-level implied list doesn't count. I would hope that's obvious enough that I don't need to explicitly say it, but... unfortunately I know how some people are.

Hm... come to think of it... I could just give the implied list the same rules as explicit lists. Which means that it's okay for top-level expressions to be indented, just so long as they're all the same indent. That solves the issue while being more flexible and consistent. I'll do that instead, and thanks to the rewrite, it's a trivial change to make.

Looks like your pedantry saved the day (again)!

---

"The lack of informative commit messages made it difficult to catch up with the spec changes. Just saying. ^_^"

Well, I usually end up putting a lot of changes into a single commit, so just reading the diffs should be enough? In any case, as said, the spec itself didn't really change much, things just got clarified.

Nuit hasn't changed very much from when I first posted it, except that # and " are now multi-line block prefixes, whereas # used to be single-line and " used to be a delimiter like in JSON/YAML.

Oh yeah, it also ignores whitespace now, thanks to your suggestion. It used to throw an error. Oh! And the second part of the @ list can now be any arbitrary sigil rather than just a string. I think that's about it...

I honestly don't expect Nuit to change much from this point onward. I think things are in a pretty stable state. But I'm still not entirely sure about handling whitespace at the start/end of a string, and I've been mulling over the idea of getting rid of the \ sigil...

-----

1 point by rocketnia 4833 days ago | link

"The reason I left off the leading 0s is that some languages (like Racket) let you do that."

Hmm, I thought JavaScript was like that too, but it appears ECMAScript 5 doesn't allow it, and Chrome's implementation doesn't like it either.

---

"Well, I usually end up putting a lot of changes into a single commit, so just reading the diffs should be enough?"

There were lots and lots of indentation-only changes. If those were separated into their own commits, with commit messages that indicated that the indentation was all that changed, it would have been easier.

I trust you to know that ultimately, it doesn't matter what's easy for me as long as it's easy for you. :-p

---

"Nuit hasn't changed very much from when I first posted it[...]"

The wordings changed. Even if you had commit messages that stated your intentions like this, I would have looked at the changes carefully in case something became contradictory or ambiguous by accident.

In hindsight, I should have just checked out the old and new versions of the project and done a diff, lol.

Don't trust me to go to this effort all the time, but I guess I was in the mood for it.

---

"Oh! And the second part of the @ list can now be any arbitrary sigil rather than just a string."

That was the most significant change, in my mind. This example of yours should be a good test case for Nuit implementations:

  Nuit  @foo @bar qux
               yes nou
          corge
          @maybe
            @
            someday
  JSON  ["foo", ["bar", "qux", "yes nou"] "corge" ["maybe", [], "someday"]]
---

"I've been mulling over the idea of getting rid of the \ sigil..."

I've been wondering about that too. It seems like " or ` will work just as well for those cases.

-----

1 point by Pauan 4833 days ago | link

"There were lots and lots of indentation-only changes. If those were separated into their own commits, with commit messages that indicated that the indentation was all that changed, it would have been easier."

Sure, but that woulda been more work for me. :P I honestly wasn't expecting you to pore through the commit log... Since I am used to working alone, I just use git as essentially a safety net: it lets me go back to an old version just in case the new version doesn't work out. So commit messages aren't nearly as important to me as they would be in a team-based environment.

---

"The wordings changed. Even if you had commit messages that stated your intentions like this, I would have looked at the changes carefully in case something became contradictory or ambiguous by accident."

Then the commit messages would have been useless anyways, right? :P

---

"In hindsight, I should have just checked out the old and new versions of the project and done a diff, lol."

Github even lets you do a diff on their website! :D

---

"Don't trust me to go to this effort all the time, but I guess I was in the mood for it."

I honestly wasn't expecting anything like that.

---

"That was the most significant change, in my mind. This example of yours should be a good test case for Nuit implementations:"

Check out "nuit-test.arc" which should have conformance tests for everything in the Nuit spec:

https://github.com/Pauan/ar/blob/arc/nu/lib/nuit/nuit-test.a...

---

"I've been wondering about that too. It seems like " or ` will work just as well for those cases."

Yeah, I know. I guess it's because the only single-line thing is a non-sigil, and I wanted to be able to slap anything in without worrying about it, so \ was a single-line escape thingy. But I think I can safely get rid of it.

-----

2 points by Pauan 4833 days ago | link

Aaaand, done! https://github.com/Pauan/ar/commit/f49cd71a5cdead32ab49deff2...

-----

2 points by Pauan 4832 days ago | link

I've made the following changes to the Nuit spec:

---

The # sigil now ignores everything that's indented further than the sigil:

  Nuit  #foo bar
          qux corge
         @nou yes
            maybe someday
        @not included
  JSON  ["not", included"]
---

The ` and " sigils now have significantly different indentation rules. In particular, `foobar is now invalid: there must always be a space after the sigil. This change means that you can now include spaces at the front of a string without using a Unicode escape:

  Nuit  `  foobar
  JSON  " foobar"
---

The " sigil supports two new escapes: \s inserts a space and \n inserts a newline. These are just conveniences for \u(20) and \u(A) respectively. This makes it a lot easier to include spaces at the end of a single-line string.

---

I got rid of the \ sigil because it's not necessary: you can just prefix the line with ` or " to get the same effect.

---

I specified that strings within a Unicode escape must be separated by only one space. That means this is valid:

  \u(20 A D20)
But these are all invalid:

  \u( 20 A D20)
  \u(20  A D20)
  \u(20 A D20 )

-----

1 point by rocketnia 4832 days ago | link

"I specified that strings within a Unicode escape must be separated by only one space."

Drat. I thought of it as a nifty way to split a large chunk of escape sequences over multiple lines. It's not like I was ever going to want to do that, though.

-----

2 points by Pauan 4832 days ago | link

That was actually never possible, because of the way the parser chunks the stream into lines and then operates on each individual line. I had considered making it possible... but I think that would have been too confusing. And it would have been far too little gain for too much work...

-----

1 point by rocketnia 4831 days ago | link

I like this line of reasoning. ^_^

-----

1 point by Pauan 4837 days ago | link

Indentation rocks!

The README should give you all the information you need to know. If not, feel free to post here.

That link also contains an almost-complete implementation in Arc of the Nuit spec.

Unfortunately, even though it's only 263 lines long, I find the implementation much too complicated for my liking, so I'm planning to rewrite it using regexps. That should be okay because of the format's simplicity and its reliance on indentation and sigils. So it's not like it'll summon Cthulhu or anything[1]...

---

You might ask, "why not just use S-expressions?" Well, that's because although S-expressions are fine for representing code, they're less ideal for representing lots of text, because you always have to enclose strings with " and you have to remember to use \ to escape things. Nuit removes the need for that in most cases.

You might ask, "why not just use Nulan's parser?" Well, the same reasons as for not using S-expressions... but in addition, Nulan's syntax is more complicated than plain-ol' S-expressions, and I wanted a format that was dirt-simple and language-agnostic, so that I could easily write a parser for it in any language (Arc, Python, JavaScript, Nulan, etc.)

So, I still think code should be represented with either raw S-expressions or Nulan syntax, but things that are language-agnostic or contain lots of text should use Nuit (or YAML, if Nuit is too simple for your needs). In particular, even though I haven't tried it yet, I believe that Nuit would be a glorious way to write HTML.

I currently plan to use Nuit for my playlist program, where it has worked out very nicely due to its lack of quoting and escaping.

---

* [1]: http://stackoverflow.com/questions/1732348/regex-match-open-...

-----

1 point by rocketnia 4837 days ago | link

To be precise, apparently the result of parsing a Nuit value is either of the following:

- A Nuit string, which is a finite sequence of 16-bit values (just like a JavaScript or JSON string).

- A pair consisting of a Nuit string and a finite sequence of Nuit values. This recursion can't create cycles.

When translating Nuit to JSON, Nuit strings become JSON strings, and string-sequence pairs become JSON Arrays where the first element is a string.

Is this right?

---

Raw text appearing in Nuit's surface syntax (which starts as UTF-8, as you specify) becomes encoded as a sequence of UTF-16 code points, right? You just use the word "characters" as if it's obvious, but if you want strings to be sequences of full Unicode code points, your 16-bit escape sequences aren't sufficient.

Does the byte-order mark have any effect on the indentation of the first line?

What if the first line is indented but it occurs after a shebang line? Does the # consume it?

If I understand correctly, every Nuit comment must take up at least one whole line. There's no end-of-line comment. Is this intentional?

---

When I use JSON, I often encode richer kinds of data in the form {"type":"mydatatype",...} or rarely ["mydatatype",...]. Here's a stab at encoding richer data (in this case JSON!) inside Nuit:

  [{a:1,b:null},"null"]
  -->
  @array
    @obj
      a
      @number 1
      b
      @null
    @string null
I don't have an opinion about this yet, but it's something to contemplate.

-----

1 point by Pauan 4837 days ago | link

"A Nuit string, which is a finite sequence of 16-bit values (just like a JavaScript or JSON string)."

A finite sequence of Unicode characters. UTF-8 is recommended, but the encoding can be any Unicode encoding (UTF-32/16/8/7, Punycode, etc.)

---

"A pair consisting of a Nuit string and a finite sequence of Nuit values. This recursion can't create cycles."

No, because it uses the abstract concept of "list", which might map to a vector, array, cons, binary tree, etc. The only requirement is that it can hold 0 or more strings in order. How it's represented in a particular programming language is an implementation detail, not part of the specification.

---

"When translating Nuit to JSON, Nuit strings become JSON strings, and string-sequence pairs become JSON Arrays where the first element is a string."

Yes, except an empty Nuit list would be an empty JSON array. Also, if a meta-encoding scheme were used, it is possible for the serializer to encode Nuit as a JSON object, number, etc. But that's just de facto conventions, not part of the spec.

---

"Raw text appearing in Nuit's surface syntax (which starts as UTF-8, as you specify)"

Actually, the spec doesn't mention any encoding at all. It deals only with Unicode characters, with the encoding being an implementation detail. Parsers/serializers can use any encoding they want, as long as it supports Unicode. Even Punycode could be used.

In the "Size comparison" section I mention that it is assumed that UTF-8 is used in serialization. That was just so that the bytes would be consistent between the different examples.

It's also useful because it mimics a common situation found when transmitting data over HTTP, so it's closer to a "real world" benchmark rather than a synthetic one. That's also why CR+LF line endings were used rather than just LF.

As noted at the very bottom, if LF or CR endings are used, then Nuit becomes even shorter. This means that even in the worst-case scenario of CR+LF, Nuit is still shorter than JSON.

---

"You just use the word "characters" as if it's obvious, but if you want strings to be sequences of full Unicode code points, your 16-bit escape sequences aren't sufficient."

Incorrect. UTF-7/8 and UTF-16 are capable of representing all Unicode code points. UTF-7/8 does so by using a variable number of bytes. UTF-16 does so by using surrogate pairs. Punycode does so by using dark voodoo magic.

All that matters is that a string is a finite sequence of Unicode code points. How those code points are encoded is an implementation detail.

Hmmm... I think the current spec actually forbids certain valid UTF-16 strings, because surrogate pairs are forbidden. So I should change the Unicode part of the spec so it works correctly in all Unicode encodings.

---

"Does the byte-order mark have any effect on the indentation of the first line?"

Nope. It's a part of the encoding and thus is an implementation detail, so it has no effect on indentation.

---

"What if the first line is indented but it occurs after a shebang line? Does the # consume it?"

Yes, the # would consume it. If you don't want that, then the first line must not be indented. The same is true of @ and ` and " This is intentional. In fact, it's actually illegal for the first sigil to be indented. This is to help avoid the kind of mistakes that you're talking about.

---

"If I understand correctly, every Nuit comment must take up at least one whole line. There's no end-of-line comment. Is this intentional?"

That is correct and it is intentional. The design of Nuit only allows sigils at the start of a line. This makes it easy to take almost any arbitrary string and plop it in without having to quote or escape it. Which means that this Nuit code:

  @foo bar
    qux#nou
Would be equivalent to this JSON:

  ["foo", "bar", "qux#nou"]
That's part of the secret to not needing delimiters and escapes. The other part of the secret is using indentation, like with `

---

"I don't have an opinion about this yet, but it's something to contemplate."

I have already thought about such "meta-encoding schemes." Nuit itself doesn't do anything special with them, but applications can use the information to do something special. It is up to the applications to parse things in the way they want to.

I'm not against a Nuit parser/serializer using those kinds of de facto encoding schemes, but I want to keep Nuit simple, so I don't plan to put them into the spec. But, I might include some standard meta-encodings for JSON and YAML. They would be built on top of the simpler Nuit which supports only lists and strings.

By the way, I would encode the JSON like this:

  @list
    @dict
      @a 1
      @b null
    @str null
  -->
  [{"a":1,"b":null},"null"]
This works because JSON keys must always be strings.

---

"Apparently I have to use \u(20) in order to put a space at the end of a string."

Yes, except that \ is only valid at the start of a line or within a " so you would have to prefix those lines with ":

  @tag a
    @attr href http://www.arclanguage.org/
    "Visit\u(20)
    @tag cite
      Arc Forum
    !
I'm still thinking about the right interaction of whitespace, ", and \ escaping. But I believe making whitespace at the end of the line illegal is an overall net gain. I might change my mind about it later.

-----

1 point by rocketnia 4836 days ago | link

"A finite sequence of Unicode characters."

This is a bit of a spec wormhole (as akkartik calls it ^_^ ), but go with it if you feel it's right.

If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?

Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode? If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode.

Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)

---

"The only requirement is that it can hold 0 or more strings in order."

Technically it needs to hold sub-lists too, but I know that's not your point.

Zero? How do you encode a zero-length list in Nuit?

Is there a way to encode a list whose first element is a list?

Oh, come to think of it, is there a way to encode a list whose first element is a string with whitespace inside?

---

"In fact, it's actually illegal for the first sigil to be indented."

Cool. Put it in the doc. ^_^

I assume you mean the first line, rather than the first sigil. The first line could be a sigil-free string, right?

Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?

---

"I'm not against a Nuit parser/serializer using those kinds of de facto encoding schemes, but I want to keep Nuit simple, so I don't plan to put them into the spec."

I like it that way too.

---

  @attr href http://www.arclanguage.org/
Er, I think that creates the following:

  [ "attr", "href http://www.arclanguage.org/" ]
---

"But I believe making whitespace at the end of the line illegal is an overall net gain."

I agree. I'm not sure I'd make it illegal, but I'd at least ignore it.

If you make whitespace at the end of blank lines illegal, bah! I like to indent my blank lines. :-p

This is partially because I've used editors which do it for me, but also because I code with whitespace visible, and a completely blank line looks like a hard boundary between completely separate blocks of code.

-----

1 point by Pauan 4836 days ago | link

"This is a bit of a spec wormhole (as akkartik calls it ^_^ )"

I have no clue what you're talking about.

---

"If I want to escape a Unicode character in the 10000-10FFFF range, can I use \u(12345) or whatnot?"

I don't see why not...

---

"Are Nuit strings, and/or the text format Nuit data is encoded in, allowed to have unnormalized or invalid Unicode?"

Invalid Unicode is not allowed.

---

"If invalid Unicode is disallowed, then you'll have trouble encoding JSON in Nuit, since JSON strings are just sequences of 16-bit values, which might not validate as Unicode."

I have no clue where you got that idea from... I'm assuming you mean that JSON is encoded in UTF-16.

UTF-16 is just a particular encoding of Unicode that happens to use two or four 8-bit bytes, that's all. UTF-16 can currently handle all valid Unicode and doesn't allow for invalid Unicode.

But JSON doesn't even use UTF-16. Just like Nuit, JSON uses "sequences of Unicode characters" for its strings. And also like Nuit, JSON doesn't specify the encoding: neither "json.org" or Wikipedia make any mention of encoding. And the JSON RFC (https://tools.ietf.org/html/rfc4627) says:

  JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
YAML should be fine too (http://www.yaml.org/spec/1.2/spec.html#id2771184):

  All characters mentioned in this specification are Unicode code points. Each
  such code point is written as one or more bytes depending on the character
  encoding used. Note that in UTF-16, characters above #xFFFF are written as
  four bytes, using a surrogate pair.

  The character encoding is a presentation detail and must not be used to
  convey content information.

  On input, a YAML processor must support the UTF-8 and UTF-16 character
  encodings. For JSON compatibility, the UTF-32 encodings must also be
  supported.
ECMAScript 5 (http://www.ecma-international.org/publications/files/ECMA-ST...) does specify UTF-16:

  A conforming implementation of this Standard shall interpret characters in
  conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC
  10646-1 with either UCS-2 or UTF-16 as the adopted encoding form,
  implementation level 3.
And I believe Java also uses UTF-16.

But I see no reason to limit Nuit to only certain encodings. And if I did decide to specify a One True Encoding To Rule Them All, I'd specify UTF-8 because it's the overall best Unicode encoding that we have right now.

Instead, if a Nuit parser/serializer is used on a string that it can't decode/encode, it just throws an error. It's very highly recommended to support at least UTF-8, but any Unicode encoding will do.

---

"Are you going to do anything special to accommodate the case where someone concatenates two UTF-16 files and ends up with a byte order mark in the middle of the file? (I was just reading the WHATWG HTML spec today, and HTML treats the byte order mark as whitespace, using this as justification. Of course, the real justification is probably that browsers have already implemented it that way.)"

The current spec for Nuit says to throw an error for byte order marks appearing in the middle of the file.

---

"Zero? How do you encode a zero-length list in Nuit?"

Easy, just use a plain @ with nothing after it:

  @
  @foo
The above is equivalent to the JSON [] and ["foo"]

---

"Is there a way to encode a list whose first element is a list?"

Yes:

  @
    @bar qux
And I was thinking about changing the spec so that everything after the first string is treated as a sigil rather than as a string. Then you could say this:

  @foo @bar qux -> ["foo" ["bar", "qux"]]
  @ @bar qux    -> [["bar", "qux"]]
---

"Oh, come to think of it, is there a way to encode a list whose first element is a string with whitespace inside?"

Yes, there's two ways:

  @ foo bar qux
  @
    foo bar qux
The above is equivalent to the JSON list ["foo bar qux"]. This follows naturally if you assume that empty strings aren't included in the list.

---

"I assume you mean the first line, rather than the first sigil. The first line could be a sigil-free string, right?"

Right, so I suppose it would be "the first non-empty line must have no spaces at the front of it".

---

"Er, I think that creates the following:"

This is true. You could use this, though:

  @attr href
    http://www.arclanguage.org/
Not as nifty, but it works.

---

"If you make whitespace at the end of blank lines illegal, bah! I like to indent my blank lines. :-p"

Nuuu. Well, maybe. I'm undecided on that whole issue.

-----

2 points by rocketnia 4836 days ago | link

"the JSON RFC (https://tools.ietf.org/html/rfc4627) "

Ah, I get to learn a few new things about JSON! JSON strings are limited to valid Unicode characters, and "A JSON text is a serialized object or array," not a number, a boolean, or null. All this time I thought these were just common misconceptions! XD

It turns out my own misconceptions about JSON are based on ECMAScript 5.

To start, ECMAScript 5 is very specific about the fact that ECMAScript strings are arbitrary sequences of unsigned 16-bit values.

  4.3.16
  String value
  primitive value that is a finite ordered sequence of zero or more
  16-bit unsigned integer
  
  NOTE A String value is a member of the String type. Each integer
  value in the sequence usually represents a single 16-bit unit of
  UTF-16 text. However, ECMAScript does not place any restrictions or
  requirements on the values except that they must be 16-bit unsigned
  integers.
ECMAScript 5's specification of JSON.parse and JSON.stringify explicitly calls out the JSON spec, but then it relaxes the constraint that the top level of the value must be an object or array, and it subtly (maybe too subtly) relaxes the constraint that the strings must contain valid Unicode: It says "The JSON interchange format used in this specification is exactly that described by RFC 4627 with two exceptions," and one of those exceptions is that conforming implentations of ECMAScript 5 aren't allowed to implement their own extensions to the JSON format, and must instead use exactly the format defined by ECMAScript 5. As it happens, the formal JSON grammar defined by ECMAScript 5 supports invalid Unicode.

---

"This follows naturally if you assume that empty strings aren't included in the list."

I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different.

-----

1 point by Pauan 4836 days ago | link

"I'm not most people, but when the Nuit spec says "Anything between the @ and the first whitespace character is the first element of the list," I don't see a reason to make "@ " a special case that means something different."

Then I'll change the spec to be more understandable. What wording would you prefer?

-----

2 points by rocketnia 4835 days ago | link

I'll try to make it a minimal change: "If there's anything between the @ and the first whitespace character, that intervening string is the first element of the list."

-----

1 point by Pauan 4835 days ago | link

That sounds great! I'll put in something like that, and also explain about the implicit list.

-----

1 point by Pauan 4836 days ago | link

"This follows naturally if you assume that empty strings aren't included in the list."

Before rocketnia jumps in and says something... if you do want to include an empty string, you must use ` or " which always return a string.

-----

1 point by akkartik 4836 days ago | link

> > This is a bit of a spec wormhole (as akkartik calls it ^_^ )

> I have no clue what you're talking about.

Me neither :)

Oh, I think it's http://article.gmane.org/gmane.lisp.readable-lisp/302

-----

1 point by Pauan 4836 days ago | link

Well, it's true that the Nuit spec intentionally ignores encoding issues, and thus a Nuit parser/serializer might need to understand encoding in addition to the Nuit spec. I don't see a problem with that.

The Arc implementation of Nuit basically just ignores encoding issues because Racket already takes care of all that. So any encoding information in the Nuit spec would have just been a nuisance.

There's already plenty of information out there about different Unicode encodings, so people can just use that if they don't have the luxury of relying on a system like Racket.

---

I see encoding as having to do with the storage and transportation of text, which is certainly important, but it's beyond the scope of Nuit.

Perhaps a Nuit serializer wants to use ASCII because the system it's communicating with doesn't support Unicode. It could then use Punycode encoding.

Or perhaps the Nuit document contains lots of Asian symbols (Japanese, Chinese, etc.) and so the serializer wants to use an encoding that is better (faster or smaller) for those languages.

Or perhaps it's transmitting over HTTP in which case it must use CR+LF line endings and will probably want to use UTF-8.

---

I'll note that Nuit also doesn't specify much about line endings. It says that the parser must convert line endings to U+000A but it doesn't say what to do when serializing.

If serializing to a file on Windows, the serializer probably wants to use CR+LF. If on Linux it would want to use LF. If transmitting over HTTP it must use CR+LF, etc.

Nuit also doesn't specify endianness, or whether a list should map to an array or a vector, or how big a byte should be, or whether the computer system is digital/analog/quantum, or or or...

Nuit shouldn't even be worrying about such things. Nuit shouldn't have to specify every tiny miniscule detail of how to accomplish things.

They are implementation details, which should be handled by the parsers/serializers on a case-by-case basis, in the way that seems best to them.

-----

2 points by rocketnia 4836 days ago | link

"Oh, I think it's http://article.gmane.org/gmane.lisp.readable-lisp/302 "

Yep! Sorry, I forgot that wasn't on Arc Forum.

See, I follow your links. ^_^

-----

1 point by Pauan 4836 days ago | link

Oh, I missed a question.

"Speaking of which, it seems like there will always be exactly one unindented line in Nuit's textual encoding, that line being at the beginning. Is this true?"

Well, not exactly. If all the lines are blank, that's fine too. But assuming at least one non-blank line... there must be at least one line that is unindented. There might be more than one unindented line.

-----

1 point by rocketnia 4836 days ago | link

What would more than one unindented line do?

If they're put in a list...

  foo
  bar
  baz
  -->
  ["foo","bar","baz"]
...then I'd say a single item should be put in a list too:

  foo
  -->
  ["foo"]
  
  @foo
  -->
  [["foo"]]
And yet now Nuit values at the root can't be strings; they must be lists.

-----

1 point by Pauan 4836 days ago | link

"...then I'd say a single item should be put in a list too [...] And yet now Nuit values at the root can't be strings; they must be lists."

That's correct. There's an implied list wrapping the entire Nuit text. You can think of it like XML's root node, except in Nuit it's implicit rather than explicit.

Calling "readfile" in Arc also returns a list of S-expressions, so this isn't without precedent.

-----

1 point by rocketnia 4835 days ago | link

Okay, fair enough. I see that was implied by the documentation's examples all along too. ^_^

-----

1 point by rocketnia 4837 days ago | link

Here are a couple of stabs at encoding HTML:

  @tag a
    @attr href http://www.arclanguage.org/
    Visit\u(20)
    @tag cite
      Arc Forum
    !

  @a
    href
    http://www.arclanguage.org/
    >
    Visit\u(20)
    @cite >
      Arc Forum
    !
I'm using > as a delimiter symbol there, since it's an invalid attribute name and it's visually similar to HTML.

Apparently I have to use \u(20) in order to put a space at the end of a string.

(Not that I've actually run this through the parser or anything.)

-----