Wiki/Report of Meeting 2024-05-23
Report of Meeting 2024-05-23
Present: Ed Gottsman and Bob Therriault
Full transcripts of this meeting are now available below on this wiki page.
1) Bob and Ed reviewed the SVG that Jan had produced for the categories. The hover information is available on the Safari browser, but not the Chrome browser. There was a discussion of the meaning of the hover information and that may require a follow up meeting with Jan to clarify. Jan pointed out in an email that eliminating certain words may have unintended consequences in terms of the organizing of the wiki. Low interest words such as prepositions would have high occurrences, but would confuse the identification of higher interest words which may not occur as often in the text. Bob referred to a counting algorithm that had been publicized in Quanta magazine that allowed accurate word frequencies without having to have the whole corpus in memory at one time. https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/ Ed pointed out that the corpus of the Wiki is not that large. Ed pointed out that he thought that Jan's technique would find synonyms, although the questions that were asked about stemming may indicate that this is not possible. Ed had expected the there would be an outline and Bob thought that they may be in the CSV documents that Jan had provided. The categories show up as the leaf nodes broken into the different areas. This will require some explanation in the meeting with Jan. There was also full list of the individual words that had been used and Jan may be looking for guidance on which words are less important. This would be clarified in the meeting.
2) Bob explained that it was not possible to move the template for the Category map to the bottom of the page using CSS. This might have to be done using PHP and inserting the template into the footer of the page, which is not accessible through the content areas. Another option is to have the Category pages have the Category map lead in the page with attached pages following below. https://code.jsoftware.com/wiki/Category:Help/Index_W.4 This differentiates Category pages from Content pages https://code.jsoftware.com/wiki/Help and as long as Categories are highlighted on the Category Tree Map it is useful information for the user.
3) Ed reviewed the issues that he is facing with the J Wiki Viewer. At the moment the debugging process of adding log statements is correcting the problem. This appears to be a problem only with J9.6 beta 7 and so perhaps Henry can be helpful with what may have been changed. The J Wiki Viewer may be an edge case that is useful.
For access to previous meeting reports https://code.jsoftware.com/wiki/Wiki_Development If you would like to participate in the development of the J wiki please contact us on the J forum and we will get you an invitation to the next J wiki meeting held on Thursdays at 23:00 (UTC).
Transcript
There we go.
I should remember my agenda.
I think we were going to start off with what Jan had done about his mapping.
Did you get a chance to take a look at that?
I did not look at the map that he finally produced.
I looked at his requests of us, and I'm not quite sure what to do, to be honest.
Okay, well, let's take a look and see.
I will pull up his map.
Share screen, and you can take a look and see if there's anything.
And share.
So this is his map.
And when you hover on a box, like one of these, basically that's telling you, and it's indexed
from the lower left corner.
So this is the 46th row and the 11th column.
And he's given it number 90.
And then this is 12 and that's 90 as well.
So I'm not exactly sure what the numbers within the boxes refer to, but I think he calls them
labels or something, identifiers.
But the 4612 is essentially the mapping of the blocks.
And then if you actually hover over one of the little dots, you get this, which the 2974,
I'm guessing is an identifier.
And then there's the series of categories that it fits to.
And then I'm guessing the dot 76 is some quantity because they go from the high down to the
low, they're ordered that way.
And the only thing is, is that the dot dot dot at the end means that you can't actually
see what the lowest categories are.
Right.
Because it starts left to right.
But it looks like it's some sort of ordered relevance ranking.
Yeah.
And it corresponds to, I can't, I'm challenged with it.
I can't actually put my cursor over it, but I think the 0.76 is our library.
Yeah.
Or it may be stats or stats may be 0.52.
But so let's see, that's one, two, three, four, five, six, seven, eight.
Yeah, that takes us down to all but the last two, unless it starts with stats, in which
case it's all but the last one is shown.
Right.
Yeah.
And then he's got the red lines between the different provinces and you can make it bigger
to a point.
Or is this as big as you can make it?
Oh, maybe that's as big as you can make it.
But I think his concern was that when you get into this area in here, it's so highly
populated, it's not really telling you very much.
Right.
And then his questions, one of them, which I kind of responded to, I think he was wondering
if there were categories or words that he could take out.
And I mentioned J6 because I know J6 is a whole category of the old version of J that
I know Eric is okay with in terms of archiving, but really doesn't want it front and center.
He doesn't really want it front and center.
And Jan pointed out in his email back to me that it might have, J6 might actually refer
to something else, which twigged my awareness as well as I'm not dealing with categories.
I'm dealing with actual words.
But I have the feeling J6 is probably a safe category to take out, but possibly the tensor
experiments that I think was it Thomas has done, those would create problems because
you can't just take the word tensor out.
That would be an issue.
Yeah.
It would affect a lot of different things.
Right.
Yeah.
We have no, to my knowledge, no way of automatically identifying less interesting words.
That's not on.
Yeah.
And possibly, I'm not sure, but possibly when you hover over these, the frequency numbers
here might give you a sense of what the, in fact, this one here, he might give us a clue
to how he's counting it.
It looks like report is 0.66, meeting is 0.62 and wiki development is 0.22 because there's
no dot dot at the end.
So that is how those are being broken up.
Right.
But apparently those measurements are not what he's looking for in terms of being able
to identify low interest words.
There's something else.
Otherwise he would just use them.
Yeah.
I think they probably reflect back on the categories.
Yeah.
But then it gets a little bit tricky because I can identify categories, but he can't respond
to categories.
But the information he's pulling out on a word level doesn't seem to be helping him.
So it is a bit of a funny thing.
But it's, I mean, he would like to have fewer irrelevant words in the corpus, but his whole
point was that he would be able to identify the most relevant words in the corpus.
So I'm at something of a loss, to be honest.
Well, and actually, as you say that, there was a recent article in Quantum Magazine about
the new way of counting frequencies.
Did you have a chance?
Had you seen that?
I did not have the pleasure.
Really interesting.
It gives you an estimate of how many times words occur in a large body of text, like
a really large body of text.
And you think, OK, well, so the easy way to do it is you count all the words and you do
a frequency and that'll tell you.
But in order to do that, at some point, you have to store all the words in the text.
And the way this works, it's really quite interesting, is you have 100 spaces.
And every time you see a new word, you put it into a space.
And when that 100 spaces is full, what you do, I think this is how it works, you basically
flip a coin 50/50.
You flip a coin 50/50 the next time a word shows up.
And I've probably got it all scrambled up.
But essentially what it is, it's a binary frequency that as you go through, you only
have to store 150 words.
Because as the words fill up, when it gets to the point where it's completely full, you
start over again with the top 100 words and flip a coin to see whether they stay in or
they don't.
So a word that had shown up 70 times may get flipped and disappear.
But apparently by the time you get to the end of it, you go through the whole script.
If it's a common word, it's going to show up again.
And whenever you get to the maximum, you go back and you knock it down and you go to an
extra flip of a coin.
So at the end of it, you might be doing eight flips in a row and it has to get nine flips
in a row to stay in or it goes out.
But the point is that by the end of it, the most frequent words are going to show up.
And it'll give you a word count and the frequencies of the whole thing.
And it just does it, it's almost like a window that passes through the whole text.
And at the end can tell you the top 100 words and their frequencies.
That would be of interest for very large purposes, right?
I mean, it doesn't really have any relevance here.
We can fit the whole thing in memory.
We can.
The example they gave though was the text of Hamlet, which isn't super large.
Then I'm missing something.
I don't know why it would be a problem.
Well, they used that as an example because it was really accurate.
It was within 15 words.
I see.
But you don't need it for Hamlet.
You don't need it for Hamlet.
You just have all the words.
But it's that accurate.
Yeah, yeah.
Right.
That's interesting.
But I mean, and Jan doesn't need it here because this corpus just isn't that big.
I guess not.
So you're saying you can't just do it on a word frequency?
He can fit the whole corpus in RAM and do whatever counting he wants to do.
But by the time he does his calculations with it, isn't that what makes it really large?
He needs to reduce that number of words.
He's not just counting frequencies.
He's doing much more interesting stuff, stuff I do not understand.
But counting the number of times a word occurs is not a problem that he has with a corpus
this small.
Okay.
So what he's asking us is whether there are words that he can take out.
Yes, but I don't think it's for technical reasons, that is, that he doesn't have enough
memory.
I think it's because he's getting a lot of noise.
Okay.
But could we not do it on frequency of the words that go in?
The most frequently occurring words are not necessarily the best words for categorization.
Yeah, that's true.
You're getting a lot of the articles and the propositions and stuff.
Oh, yeah.
Oh, gosh, yes.
Yeah, yeah.
I suppose if you eliminated those, though.
Well, yeah, you take out stop words.
And he was looking, he also asked for essentially ways to find synonyms, which again, I would
think would be something that would fall out of his efforts, but apparently not.
And he mentioned stemming, which only works for tenses and plurals and singulars.
That's not going to find synonyms.
So yeah, I'm at something of a loss at this point, I'm afraid.
Okay.
Well, I guess what we could always do is see if we can set up another time to have a call
with him.
Sure.
And see whether there's something that we're missing that we might be able to do for him,
or maybe there's work that we can do on our end that helps him out.
Yeah.
It's kind of cool to see the map, but it is also sort of not clear to me right now how
it's breaking out the information.
I am a little confused because I thought we left it with him that he was going to produce
an outline for you that is a set of, that is a proposed category hierarchy, independent
of the map user experience that might one day be applied to it.
Well there were some other CSV documents that he had there.
Yeah, I didn't look at them.
We should have a look.
Yeah, I was going to say, I've pulled them down so I can take a look at them.
Just wondering, trying to remember where I put them.
I will look and see where I put them.
Way too many folders.
There we go.
Okay.
So let's see if I can now share.
We'll take a look at these guys.
Make that a bit bigger.
Hopefully.
There we go.
Okay, I have to do it through view.
Of course, Apple in their wisdom have used a different.
There we go.
Bottom category names.
Yeah, so these are the bottom category names.
I'm guessing that they seem to be in order.
Yeah, so this is just the leaf nodes.
Yeah.
I would guess.
Oh, no, maybe not.
So he's broken it into different areas here.
Yeah.
So there's six here, and then it starts up again.
Of course, 395 with New York City J user meetings, which doesn't surprise me.
Yeah, it'll be big.
So I'll take more of a look at that and see whether that gives an indication.
Yeah, it's huge.
Massive.
Yeah.
So I guess these again, back to the map, this would be the coordinates.
And then I guess the size.
That's what this is.
That's what that number means is the size of it.
And population, I'm guessing, is the number that's in here.
Let's see if we get down to.
Yeah, population will be these six.
Yeah.
So I may have to go back to him and ask him kind of what I'm looking for here and the
name one and the name two and how that reflects on it.
Yeah.
And maybe you have to stop sharing to do this properly.
And then all terms.
I'll share that one.
This was more just.
This is almost just like an Excel table.
In alphabetical order of all the words.
Does look that way, doesn't it?
Yeah.
Well, in fact, I was looking through and I found Terrio in there.
I found Gotsman in there.
And so it's like if the word shows up, it's in this table.
So that's just essentially all the terms that are available.
That's what he's using.
And so I guess what he's asking is whether there's specific words we could take out of
this that aren't required.
Sure.
But he can identify them as easily as we can.
I, well, some of them he can.
And I guess some of them he's wondering whether they are particularly we think they were useful
if you're making a mistake to take them out.
Like for instance, the something as simple as at is a preposition.
Is it a preposition?
Yeah, I think so.
But it's also a word that's used at the top.
It's the description of conjunctions.
So you really probably couldn't take at out.
Right.
But there aren't a lot like that.
And again, I mean, he's got some J knowledge.
He can probably figure it out.
Yeah.
I'm not about to go through this term list and say, no, no, no, no, you can't take that
out.
No.
Yeah, it would almost honestly, it would almost be a matter of having somebody go through
and just take out stuff that they thought and then see what the result is.
Right.
Almost tuning it.
For that level of effort.
Yeah.
You could produce a category hierarchy.
In fact, you have done so.
Yeah.
Anyway, those are the documents that he sent out with us.
So, yeah.
So I think maybe setting up another conversation with them would make sense.
Yeah.
I think you're probably right.
For my templates and stuff.
The only the downside I found out is I was wondering whether I might be able to do it
with CSS and after.
But that specifically does not work.
Because then you're playing with the structure of the DOM.
And what it will allow you to do is it will allow you to insert a content after something
in a paragraph, say.
But it won't let you get take one part of the DOM structure and move it to another part.
Which is really what I'm looking to do.
Insert you cannot move.
You can't move it.
I see.
Okay.
And so what ideally what I would want to do is move it to the bottom of the page, maybe
put it in the footer.
And I was actually there's two ways I can see to do that.
One is I'm pretty sure that Raul is able to do that with PHP.
Because he was able to put that up in the upper right corner, the map and the search
and all that stuff.
And that's just doing by an insertion into that part of the script.
He could do the same thing, I think, and put it into the footer.
And we just put it on every page in the footer, which is actually kind of what we want.
Pretty good solution.
Yeah.
I'll have to ask him about that and see whether that's a reasonable request.
The other way to do it is in the and I kind of thought this afterwards.
The other way to do it is not to worry about the fact that you're putting all the categories
on the top of the page when it's a category page.
As long as you highlight the category it is, and then it's followed by all the pages.
Because above that category list, you can always write something at the top that indicates
below this category list are the pages listed to it or a description of the category page
you're on, which is what I was originally thinking of doing.
And highlight within that map, which page you're on.
And then right underneath it would be all the pages.
It's going to do that automatically.
So that's the other way to do it is to say, that's what we do for category pages.
The downside of that is when you get to content pages, the category pages are going to be
below content pages.
In category pages, it'll lead with the categories and it'll have page links underneath.
Yeah.
I don't think that bothers me too much.
No, it actually distinguishes between category pages and content pages.
There is that.
Yeah.
And if the category list is always highlighting the category you're particularly on, I think
that's useful.
Yes, absolutely.
So you can see where you are and then below it, you could see all the pages attached to
it.
Yeah.
And that would be what you see.
Yeah.
And then you can navigate and you're not going anywhere.
It's just highlighting different categories.
Categories and then the pages at the bottom will change and the highlighted category will
change.
Yeah, exactly.
When you click to a particular page, then you're going to see the page and underneath
that category list will be again.
Yeah.
Right.
Yeah.
I think that's fine.
Yeah.
It's that.
That was anyway, the two solutions, one is to drop it to the bottom in the footer and
until I talk to him, maybe what I'll do is just see whether it's feasible to do what
I was proposing with category pages.
Category pages.
Yeah.
Okay.
Yeah.
Well, good.
Yeah, that sounds great.
And I guess the other thing is the jwiki viewer or the viewer.
Yeah.
It's a little frustrating.
It's having a problem.
It's having a problem.
And I added a bunch of trace to try to track down the problem and I got closer and closer
and I added enough trace that it started working for no good reason.
The problem evaporated.
And that's where I am now is that if I put in enough trace code, which is just logging
to a file, everything works.
So that's a problem.
My debug fixes the problem, I guess is what I'm saying.
You saw what I sent back to you, right?
The log?
Yeah.
I have a much more detailed log because I put in a lot of additional log statements and
something's happening while it's trying to, I think something is happening while it's
trying to move the old, no, while it's trying, basically while it's trying to write the new
database file out.
Yeah.
But I got the log statements right up to that.
And as soon as I added a couple of additional log statements, the last couple that I added,
everything started to work just fine.
So that's where I am now.
And is it only J6 that that's a problem for?
J version six.
Yeah.
I haven't gotten any reports that it's failed anywhere else.
Okay.
It might be worth an email to Henry.
Oh yeah.
Well, cause it's different, right?
It's working in the other places.
It's just not working.
Oh yeah, absolutely.
Yeah.
So yeah, I would like to have something a little more solid to say to him.
I haven't given up yet.
It's just where I happen to be at the moment.
Okay.
If somebody is having trouble downloading it or using it, the answer is use 9.5 and
you'll be fine.
Okay.
Cause I recorded with Connor two days ago, but we didn't get a chance to talk about it,
but that's what I can tell him.
Just use 9.5.
9.6 is a beta.
9.6 beta seven is a problem.
The earlier betas, apparently there's no problem.
I never got any reports on those.
Well again, you chase it down if you can, cause that's useful, but you might let Henry
know because he'll know what he changed in that space.
And that might be really significant to him.
Cause a lot of times when you change something, you're kind of going fingers crossed.
I wonder what's going to happen.
Oh, nothing happened.
I guess everything.
Right.
And then you don't hear anything.
You don't hear anything.
And then somebody too late.
Yeah.
And then you hear something and says, well, what about this?
And you go, Oh, fascinating edge case.
Foundational.
Let's, let's undo that.
Quite quite.
Yeah.
Quite.
Yeah.
And I know cause it's a beta.
He's got no problem with hearing feedback on it.
He welcomes that.
Yeah.
It's not like you're just creating noise.
Yeah.
All right.
Anyway, that's about, that's about all I got.
All right.
That's all I've got too.
Unfortunately.
I'm going to go back to the chat.
- Marvelous, Bob thanks so much, take care.
- You as well, bye bye.
- Goodbye.