Wiki/Report of Meeting 2024-06-27
Report of Meeting 2024-06-27
Present: Ed Gottsman, Raul Miller and Bob Therriault
Full transcripts of this meeting are now available below on this wiki page.
1) Bob has finished up his work on the Orphan's page and has got the number of Orphans down to 37 pages. Initially, Orphans had been the pages with no links to the based on Eric's original crawl of the site. When Bob went through the categories he found that some of these pages could be linked to the categories but they remained in the Orphans category. This last clean up has rectified this issue and the Orphan's category page is much more reasonable.
2) We had a quick discussion about Jan Jacobs category creator. Ed felt that the process continues to move forward, but wonders why Jan seems to be concerned with how his curation corresponds to the existing tree. The fact that they don't correspond is actually good because it could indicate a different way of looking at the problem. If it is going to require a lot of human intervention it might end up being the same as having someone else do the curation to get a different point of view on the selection of categories. Bob will set up a meeting with Jan and Ed.
3) Raul talked about the Modifier Trains and mentioned that there was a While definition that could be described ^:[.^:_ that he felt was useful enough to justify that the Modifier Trains had been brought back.
4) Raul also talked about the challenges that he was facing with the old x. y. u. v. m. n. Ed felt that this could be corrected by adding them as J words in the search. This may not work for stemming to longer phrases as the stemming doesn't have an API and Ed is unaware of the algorithm used. Ed wondered whether wild card searches might be available, although the J Viewer is not currently set up to handle regex inquiries. Ed suggested the Wiki corpus could be loaded and searched with regex. Ed suggested that he could create a database that would contain the full text and that Raul could then use the regex search.
5) Bob mentioned that the original Marcin Żołek email seemed not to be on the Jsoftware forums and Raul suggested that it would be the dev forum that Bob was not subscribed to. Ed said that he had not yet incorporated the google groups in the J Viewer as it seems to be a very challenging problem. Bob wondered if it would be possible to just subscribe to the group and have the information transferred as it is created. Ed said that he would commit to feeling guilty about not doing it yet. Bob said that was not his intention, but it would be good to have access. It seems that an archive could be created but google does not make it easy to decode the archive.
For access to previous meeting reports https://code.jsoftware.com/wiki/Wiki_Development If you would like to participate in the development of the J wiki please contact us on the J forum and we will get you an invitation to the next J wiki meeting held on Thursdays at 23:00 (UTC).
Transcript
I guess the first thing I was working on was the orphans.
And I'll just share the screen.
And I've got it down to this many pages now.
And these are truly orphans.
What I realized is there's two things that in the process of categorizing that I was working against calling things orphans.
Originally I was calling things orphans if when Eric did his survey of the site, he found that there were no parent pages, which makes them orphans.
But then what I did is I went in and looked at those pages and said, hey, this kind of fits this category.
And so I've attached category pages to them, which still when you go to what links here doesn't show up.
But I felt if I've got it as a category page, it's as good as a link.
In these particular cases, these actually are kind of dead pages in a lot of cases, but in some cases are forwarding pages that go nowhere and stuff like that would probably get cleaned up.
But they're not only have no reference from other pages.
But they also are not attached to any categories other than orphan, which is why they show up.
So that's all kind of cleaned up now at this point.
I can see some of those probably just deserve to get linked into like essays as an essay page that lists them all.
Well, except that there's really nothing to them.
They're garbage?
Pretty much.
Not ready yet.
And then in this one, I'm not sure what Dan was doing with this one, but it was like a whole series of kind of strange.
Oh, I think I see what he was going to do.
Yeah, he was going to do he was going to set up a former shave like presentation and this is going to be his content.
He just never got around to finishing it.
Okay.
Well, I wasn't going to kill it.
I just that was.
Yeah, that's an example of the stuff that's on orphans.
And then the other thing that I mentioned this before to you, Roll, if I put in one of the index boxes across the top that you can jump around different places like labs had something like that.
A lot of the more complicated plot has that where it's got this blue stripe at the top and you can jump around in the categories.
If I put those into orphans because they had no parent page, then everything that was a descendant of those went into orphans.
So a number of them were cleaned up as soon as I took those out and said, those aren't orphans, they ended up putting them as well into wiki development, I think, because they're kind of an interesting thing to look at if you're trying to develop those kind of things.
But we're down to 37 pages and, you know, that's considered lower priority and those that's kind of expected if you land there, you're not going to think, oh, this is silly.
Yeah, no, I mean, you might think the page is silly, but you're not going to think that it's silly, it's categorized as an orphan.
Yeah, we were way well over 400. So, you know, it didn't make sense that that there would be that many pages that were, especially when you take into account the categories being sort of linked to them.
You know, the categories are going to have these pages as links at the bottom of their pages, which is why I, if I saw that they had categories attached, I took orphans out.
And so no longer orphans. But when you go to a category page, they do show up there so they do have links to them.
Then the curation will be how useful those are, but that's down the road.
And I will update this notice as well. I just finished up today so I'll say that I'll explain what I just explained in there so people know what the orphan, what an orphanage, an orphan is and why they're in the orphanage.
So, that's about all I got with that.
The other item I had was Jan Jacobs email that he sent to Ed and I about his work on trying to trim the tree, heuristically, I guess for lack of a better term, using word formation and numbers of occurrences and stuff.
I don't know whether you had a chance to look at that or not, Ed, his stuff he sent?
Yes.
Okay. What are your thoughts about it?
Well, one of his questions was whether the low-level categories corresponded to the lowest level of categories in the wiki hierarchy.
I'm not sure why that's an interesting question. Strikes me that the whole point is, can you come up with something, if not better, at least equally valid, shall we say.
So, the question of whether they correspond doesn't strike me as terribly interesting, but they don't correspond as nearly as I can tell.
Yeah, and I was guessing when he put that question is he probably is trying to see, I think it's good that they don't correspond because it shows a different way of looking at it.
Right.
But I think what he's trying to do is find out different ways, for lack of a better term, our semantics of how we categorized might match to the heuristics of how he categorized.
So, are there links there that we can start to draw on that say this group might be together? I don't know.
It seems like a lot of manual effort. I'm getting confused about that. I thought the whole point was to avoid manual effort.
Yeah, I mean, I wouldn't say. To me, the whole point was to see if there was a different view of the wiki.
Sure.
And it might involve manual effort. But you're right, if it's more manual effort than doing it completely manually, and it could just have a different personality.
Yeah, that's the question.
Yeah, yeah, exactly.
If it's also fabulously better, that might be one thing.
Well, we had our fingers crossed that it might be fabulously better. But at this point, I don't see that yet. But it's still interesting.
Yeah, and I'm happy to sit down with him and you and me again and talk it through. I don't have a problem investing that kind of time in it.
Yeah, and that's what I was going to suggest is that we could do the same thing and get a better sense of exactly what he's looking for.
It sounds like the J words and things like that you supplied him were helpful to him. So, yeah.
Yeah.
Well, it narrowed down his search, I suppose. Yeah, anyway, that's about all I had.
Okay. Raul, you and I had an exchange about x.y.u.v. and this sort of thing that I didn't entirely understand. If you wanted to peel back a layer of detail on that, I would appreciate it.
Okay. Let me back away from that and first start out that the thing I was focusing on behind that got me there was the examples of the various oddball grammatical constructs table.
And I know last week we were saying, "Is this really worth it?" Doing this string stuff. One of the ones I found there was a "while" statement that just seemed neat enough that it almost feels like it justifies the whole thing.
Which is the cap colon.
Yeah.
Cap colon, cap colon, under bar in parentheses is a while. So that's fine.
Yeah.
So the other thing that got me there is I was discovering that the – you already know this – that the SQLite query syntax doesn't really allow me to do searching for arbitrary collections of things in a sequence.
I can't look for a verb, verb, noun in the way we currently have things structured.
So that got me thinking about something that you had mentioned earlier, which is the whole concept of stemming, where you're searching for a word and when you search for this word it will automatically search for some related words.
That was the thing I was trying to understand. Now at the same time, coincidentally, about the same time I hit that, I stumbled on a page that looked interesting. I have to go dig it out again.
But it was the old u dot v dot type words. And it seemed to me that page would be more useful if we could upgrade it and replace the u dot with u, v dot with v, so it would work in a –
at one point you could use – there's a form you could use to say support the old syntax. But that's been ripped out. That doesn't even work anymore.
So at this point in time you could say that any documents that are using the old form are archaic and should be updated.
So I got to thinking, one, it would be nice if we could use your jviewer to search for these things, but that doesn't really work because when you search for u dot you still get searches for u.
Search for v dot you get searches for v.
That is fixable. I will just interject briefly.
And that's what I was thinking is it's useful perhaps when searching for u to find u dot because a beginner isn't going to know necessarily where that – those might still be interesting pages.
But we could probably live without that. But if we wanted to keep that and we also wanted to search for u dot, then the stemming process would let us have u search for the parent and u dot search for the more specific word.
The problem I suspect you may run into with stemming is that I think SQLite supports two, maybe three stemming algorithms. I'm not sure.
And they're black boxes as far as I know.
And I'm not sure there's a way – there may well be a way because SQLite is very flexible in this regard, but I don't know if there's a way to implement your own stemmer and tell the indexer to use it.
Which is I think what you would be thinking about doing.
So there's no stemming API that you're aware of?
Not that I know of. But that may reflect a failure of imagination on my part or something.
You still know more about it than I do.
Be careful.
If we add x dot, y dot, u dot, and z dot as J English glyphs, you'll get indexed accordingly.
So you'll be able to search for x dot, y dot, u dot, v dot, whatever.
Even though I tested it out and the word formation primitive keeps u dot together as a single glyph, a single word.
And that's all that's necessary to make it work.
So I could definitely add whatever you'd like me to add to the list of glyphs that get translated to J English and accordingly indexed and then are searchable.
Just let me know and I can do that.
Well, I guess here's something to try out in maybe a very small context so you don't have to spend hours and hours waiting for the results.
Instead of having J54, J55, it was like JV54, JA55 with a second letter after the J indicating it was adverb or conjunction.
And then C of stemming would automatically recognize that.
Stemming that doesn't recognize an English-like word is not going to stem to anything, at least not the Porter Stemmer.
Okay.
That's not going to happen.
I'll just study the stemmers to see if there's anything possible leverage there, I guess.
Yeah.
You can do -- I'm pretty sure you can do string wildcard searches.
So it's possible you could do a capital J lowercase v star, for example.
I don't know what you lose when you start to do that.
It probably slows down pretty significantly, but you might not care about that.
And the only problem is that at least with the J viewer, we're not set up right now to support entering that kind of syntax.
The star would be treated as times.
Yeah, I have no problem constructing my own queries.
I just want the queries to do something besides return zero results.
Yes, right.
Exactly.
So hang on.
Let's see real quick here.
That was the most frustrating thing, is not that I got zero results, but that it wasn't telling me that -- it wasn't even giving a warning that that was a valid query.
Yeah.
Such is life.
Oh, yeah.
Okay, yeah.
We're set.
So SQL has the like.
And you just use the percent wildcard.
Yeah, percent wildcard.
Will that work with full text searches, though?
Well, that's my question is I don't know what you lose when you start doing wildcard searches.
I suspect, again, that things slow down pretty significantly.
But they would work.
They would work.
You would get non-zero results.
Yeah, they would work.
But in a full text context, there's a big difference between two symbols in a sequence on a page and two symbols in sequence next to each other.
Right, but percent gives you that, right?
You could search for JV percent.
That would be -- well, I don't know if percent works in full text.
But if it does work, I don't know if --
I doubt that it does.
I'm sorry.
I misunderstood your question.
If you're dubious about that, I am dubious about it, too.
Yeah, of course, there's also -- I noticed that the full text field, when I was in the browser, it has a bunch of little pieces.
And so maybe the full text search is constructed from that structure.
And if I understood it, I might be able to do something useful with it, build up my own queries somehow.
The other thing you could do is just load the whole corpus into memory and do regular expression searches.
You mean like the wiki corpus, for example?
Yeah.
The wiki corpus and also the GitHub corpus.
And I guess we already have regular expression support for the J code search.
So I should try doing a little more leveraging of that.
Build a big regular expression for that.
Good way to go.
That's not going to be as thorough as the JViewer, but at least --
Oh, I think the reverse might be true.
I think it might be more thorough than the JViewer.
You're not trying to work your way around the peculiarities of SQLite.
You're sort of going at it directly.
It would be more precise, but it wouldn't have the GitHub content, it wouldn't have the Rosettico content, it wouldn't have the forms.
No, I suggest that you should take all of that content.
No, I'm sorry, you're talking about using the search mechanism that Chris did.
No, I'm saying don't do that.
I'm saying I can deliver to you a database of plain text that is not J English.
Any subset of which you could load up into RAM and do Perl-compatible regular expressions against.
And I presume it has a corresponding delimiter and some way of finding your way back to the original documents?
Well, what you do --
I guess I can --
Think about how I'll put it together, but the notion would be you'd get a database that would have a link column, a title column, a text column, and maybe some other stuff.
And it might make sense just to do a -- if it's just all one document, I can search for text that's nearby if I wanted to locate the original documents.
No, it would be a database.
It would be a field for the link and a field for the title and a field for the text, the full text, the HTML, whatever.
And you would SQL read them into memory, whatever subset you wanted.
And once they were all nicely put to bed in memory, you could do Perl-compatible regular expression against them.
Okay.
Sounds like it's worth the effort.
All right, I can do that.
One other thing I was going to mention, Raul, you were talking about the email from Marcin that triggered all of this.
And I thought -- and I think Ed said the same thing.
I don't remember seeing that.
When I looked back to try and find it for doing the notes, it was a conversation between you, Henry, and Marcin.
And it was in developers.
It was in your Google mail.
But I don't think it actually was on the forums.
There's two forums.
There's a dev forum and the other forum.
Okay, then it was on the dev forum.
And I guess I don't -- so have they kept the dev forum separate from the other forums?
It's a different Google group, I'm pretty sure.
I think it's --
Okay.
Because originally each forum had its own Google group.
And then Chris put them together.
But I guess maybe he left dev separate.
Yeah, it might have been the source forum became dev or something like that.
That's probably what it is, yeah.
I think the source forum was the only one that I wasn't subscribed to.
I just blanketed all the other ones for info.
But the source, I thought, I'm not going to be looking at that.
If I look at that, we're in much bigger trouble.
Yeah, it's not really the source forum.
It's like a bunch of development stuff.
But it's -- you don't have to -- I put the link in the chat for the Google group.
Okay.
Okay, well, I might grab that and put that into my Google groups.
Thank you.
I don't know, Ed, if you're indexing that group or not.
I'm not indexing anything on Google groups yet.
I have even ceased to admire the beauty of that problem.
It is so dispiriting.
I think that's probably by design on Google's part.
Yeah, I've looked.
Nobody knows how to crawl them.
I mean, we can crawl anything.
There are farms of servers internationally that will hit your --
under your control will hit your target site from many, many different directions at once
so that it doesn't look like it's being crawled.
And they will pretend that they have headless browsers and they run JavaScript
and they look like real people.
And they'll hit anything but not Google groups.
What about if you just put a dummy subscription in and then that would just send off,
like, everything else?
Like, it would -- the same as what I'm seeing.
I'm seeing all the postings.
Oh, I see.
Don't do it as a --
As a crawl.
Just do it as the stuff that's being fired at you.
Do it on an event-driven basis rather than going after a corpus.
That's an interesting thought.
Yeah.
Six months ago it would have been an even more interesting thought.
Well, I was thinking -- but, I mean, we can go back and reclaim those.
I mean, that's just six months, right?
That's just manual effort, yeah.
Yeah, yeah.
And it hasn't gone the way -- the amount of mail going through that has dropped
significantly since you switched it over.
Well, that's partly why I'm not so worried about it to be perfectly honest.
Well, yeah, I can see that.
But, you know, I think it should be done because what we're getting into now is a
lot of the discussion is going on with the betas, and that's useful information when
you're looking -- trying to look back and figure out what was going on.
And that wasn't happening -- like, the beta came up since that group was created.
So you don't have that information prior to it.
Yeah.
Hmm.
All right, well, I will certainly commit to feeling guilty about it.
But I'm not going to go beyond that.
That's not the point.
I understand, but I'm just letting you know.
I'm glad that they said that I didn't step up.
That really wasn't what I was aiming for.
I'm just trying to think of whether there's some way -- like, when I look at --
like, in my browser, I'm seeing all those groups.
I guess I have to open them up and copy them and do all that stuff to get the
content right.
Yep, yep, yep.
Oh, well, anyway, that's -- if you can think of a way that you can do it live
so that new ones come in, that might be the easiest way to address that.
Yeah, yeah, that's true.
So I see two to-dos out of this.
One, as I understand it, Bob, you're going to get in touch with Jan and get the
three of us together.
I'm going to make a database of plain text.
Well, probably mostly HTML for a row.
I think that's it.
Well, aside from looking at the forms for you, whether you think that's a viable
way to do it.
I have looked at them.
I have looked at them in some detail.
Yeah.
I'm already an Amazon Web Services person.
I'm an Azure person.
I am not looking forward to also having to become a Google Cloud Platform person.
Yeah.
Well, I can also -- I mean, Chris is the one administering it, so whether he can
produce some kind of a --
I looked into that, too.
I don't know that he can.
He can produce an archive, but it's not an archive that mere mortals can read from
what I've read.
Google doesn't make it easy.
They don't make it easy.
So -- and I'm guessing when they say an archive, it would be an archive that
Google can extract, right?
Precisely.
Yeah.
Which is -- the world of archives has changed recently.
I don't know whether you've heard about MTV and Paramount.
No.
Paramount has decided since there's tens of billion dollars in debt, they can no
longer afford to keep their servers running.
They had all the MTV news from the 1990s into the early 2000s.
And they've axed them.
Yeah, they just axed them.
They're gone.
All of the hip-hop history and stuff that was included in that, it's gone.
I have deeply mixed feelings about that, but okay.
Well, yeah, but I'm just saying as a cultural archive, I think it's --
Yeah, yeah.
And then last night I heard that Paramount has done the same thing with
Comedy Central.
Oh, wow.
They had Comedy Central shows back to the 1990s.
And they say you can still go on their site and see old shows as they've
chosen to keep them.
Right.
But the point was you used to be able to go back in the archives and see
everything they'd done.
Yeah.
They've ditched them.
They're done.
They're gone.
It's the equivalent of taking the boxes of tapes out to the junk and throwing
them away.
That is really peculiar because you would think it would be monetizable
somehow.
Well, I --
I mean --
Yeah.
I think it was probably a C-suite decision.
I'm sure it was.
And they don't -- they're not thinking about -- I mean, to me, it's not on
the level of war crimes, but culturally it's pretty close to that.
You don't go do that.
That's just -- that's terrible.
Yeah.
Wow.
So that's why when I hear archives and archiving, I'm thinking, yeah,
Google having the control to be able to do that and not having an API to do
it for other people.
Yep.
They just -- and Google's famous for deciding not to do something anymore.
Yeah.
And you just hooped.
Yep, you are.
Yeah.
Yeah.
So anyway, that's -- I think that's about all we got.
All right.
It was good to see you both.
Yeah.
I'll get to work, Ro.
I hope to have something for you by end of day tomorrow.
Thanks.
No rush.
I'm -- because I do have other avenues of approach here where I'm
gradually thinking of things to look for to dig up these obscure little
bits of grammar.
All right.
Well, let me know if there's anything I can do.
It definitely helps.
I'm not trying to discourage you.
I'm just saying it's not all on your shoulders or anything.
All right.
Good.
Cool.
Good to know.
And I'll get in touch with Jan to see what times might work for him.
Is there any particular times for you, Ed?
No.
I'm -- it may not matter, but on July 9th, I'm flying over to Ireland for
three weeks.
Oh, good for you.
I'm incommunicado.
I'm more or less in the same time zone as Jan at that point, so that
works as well for him as for me.
But, no, there aren't no times that are better than others.
So whatever works for you is probably okay for me.
Okay.
Well, I think it'll probably be sometime around midday from what I
remember for his time.
Yeah.
Perfect.
Okay.
I'll get back in touch with him.
And, yeah, I think that's about it.
So be well.
Keep in touch.
Take care, gentlemen.