Advanced search of Librivox site by keywords (exact match) yields anomalous result

Comments about LibriVox? Suggestions to improve things? News?
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

I recently recorded a version of "The Red and the Black" for Librivox.

On the Internet Archive site, this audiobook is listed as having the following "Topics":
librivox
audiobooks
bildungsromans
france -- fiction
young men -- france -- fiction
fiction -- ambition
fiction -- church and state
fiction

These appear to me to be highly appropriate (though exactly how they have come to be associated with this audiobook, I am unsure — I know only that I, as soloist and project creator, did not supply them).

In order to find out whether these tags might have been added by Librivox (rather than by Internet Archive), I decide to try an "advanced search" of the Librivox site here: https://librivox.org/search?primary_key=0&search_category=author&search_page=1&search_form=get_results

I selected the "Exact match" checkbox, and in the Keywords field I typed (without the quotation marks) "fiction -- church and state". I left all other fields at their default values and executed the search.

The results did include my recording of "The Red and the Black". They also, however, included "The House at Pooh Corner" and a host of other books (in fact, close on 135 screens' worth) for which issues of church and state are of no relevance at all. Certainly the instance of our "Winnie the Pooh" audiobook at Internet Archive does NOT have the Topic "fiction -- church and state".

It would appear to me that there is something faulty about the way this Keyword search has been implemented on our website.

As a secondary issue, I would also be interested to know if anyone can shed any light on how these tags came to be associated with my recording of "The Red and the Black" at the Internet Archive site. Are they a mirror of keywords on the Librivox project database? If so, how would they have come to be entered there?
Last edited by TheBanjo on February 28th, 2024, 7:02 pm, edited 1 time in total.
TriciaG
LibriVox Admin Team
Posts: 60809
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

Usually they're a mirror of what was entered here. But in this case, it isn't. Here's what is in the database:

fiction
France
bildungsromans
ambition
church and state
young men

I cannot see who changed them at Archive, but if I'm reading the horrible logs correctly, it looks like the ones you mention with the dashes are the original ones, and the ones in the catalog got changed. We don't have logs for who changes things in the database.

Personally, I think the dashes are unneeded in data for our keyword search.
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
TriciaG
LibriVox Admin Team
Posts: 60809
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

To add: I THINK the keyword search is an "or" search. So "church and state" would find records with church, state, or and in the keywords. I'm not 100% certain of this, though. Your searching for "fiction" probably brought up all the other results.
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
TriciaG
LibriVox Admin Team
Posts: 60809
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

One last thing: The "Exact Match" is for the reader, not the keywords.
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
annise
LibriVox Admin Team
Posts: 38681
Joined: April 3rd, 2008, 3:55 am
Location: Melbourne,Australia

Post by annise »

re the 100s of irrelevant results - we are aware that there is an annoying displaying problem - in collections it displays every contribution, not just the match - it is very overzealous. I didn't know it matched dashes but I suppose it does.

Anne
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

So in fact there is no way of searching for audiobooks that have the keyword phrase "fiction -- church and state"? Doesn't it seem a bit odd that a search for that key term is going to return a list of EVERY work tagged as a work of fiction in our collection?
Last edited by TheBanjo on February 28th, 2024, 7:28 pm, edited 1 time in total.
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

TriciaG wrote: February 28th, 2024, 6:45 pm One last thing: The "Exact Match" is for the reader, not the keywords.
Ha! That's an interesting "gotcha". What an odd field to have chosen for this unique privilege. I wonder why it's not available for other fields?

It does strike me that refining the design of this search facility should be a relatively straightforward process. At the end of the day, we're really just talking about flexibly building a SQL SELECT statement that meets users' reasonablly forseeable requirements. If we're going to go to the extent of allowing the association of audiobooks with key terms such as "fiction -- church and state", we ought to support searching by such key terms, given that this should be pretty easy to implement.
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

TriciaG wrote: February 28th, 2024, 6:42 pm To add: I THINK the keyword search is an "or" search. So "church and state" would find records with church, state, or and in the keywords. I'm not 100% certain of this, though. Your searching for "fiction" probably brought up all the other results.
Just to be clear about this, "fiction -- church and state" is a single key term, not multiple key terms. The dashes are part of the key term.

It would appear that you are quite right about this having been implemented as an OR search, though: our search facility appears to have pulled up all results whose key terms include "fiction" OR "church" OR "state". And possibly even "OR --", though I haven't checked that.
Last edited by TheBanjo on February 28th, 2024, 9:29 pm, edited 1 time in total.
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

OK, I've had a little more time to think about this.

First off, thank you for the helpful and informative replies I've received so quickly.

Second, I'd like to make clear that I'm not particularly troubled, for myself, about any issue related to key term searching in Librivox, as it's not a search technique I've ever felt impelled to use. I began to look into it a little more closely only because of the separate discussion that TedL has initiated at the moment on this forum about issues related to key term searching.

It's now clear to me that there is no direct or necessary connection at all between the (visible) key terms for a Librivox audiobook that appear on the Internet Archive site for our audiobooks, and the (entirely invisible, to the end user) key terms associated with that same audiobook at libribox.org.

Furthermore, it would appear from TriciaG's comments on the use of hyphens in key terms (not especially in favour), and from the actual design of our key term search facility (implement as an OR search of all terms entered in the relevant field) that the Librivox.org community is not particularly eager to adopt the Library of Congress style of keywording used at Internet Archive for our audiobooks, where the use of paired hyphens is a key feature. For myself, I don't have any particularly strong feelings about this one way or another, as I doubt many users are in fact trying to use keyword searches at Librivox.org anyway. (If they are, they certainly won't be getting far, poor souls!). Searching our collection using Library of Congress type key terms (including hyphens) at Internet Archive works just fine (see, for example: https://archive.org/search?query=mediatype%3A%28audio%29%20AND%20subject%3A%28fiction%20--%20church%20and%20state%29 ) and anyone who wants to search our collection against such key terms should clearly conduct their search there.

Given that the de facto Librivox.org philosophy is NOT to adopt LOC type keywording, I can see there is little point in refining the Librivox.org search facility to allow for an exact match against an LOC-style key term.
TedL
Posts: 570
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

Librivox.org is a wordpress site. Wordpress site managers use search plugins. There are many good ones available that are far better than this at searching, and they can be installed and set up in half an hour.
Availle
LibriVox Admin Team
Posts: 22451
Joined: August 1st, 2009, 11:30 pm
Contact:

Post by Availle »

EXCEPT that our search function doesn't search the individual catalog pages on wordpress (aka our website) but rather the database we have behind the whole thing and which is entirely separate from the website. So... no wordpress plugin doing anything here I'm afraid.
Cheers, Ava.
Resident witch of LibriVox, channelling
Granny Weatherwax: "I ain't Nice."

--
AvailleAudio.com
TriciaG
LibriVox Admin Team
Posts: 60809
Joined: June 15th, 2008, 10:30 pm
Location: Toronto, ON (but Minnesotan to age 32)

Post by TriciaG »

Yes, as I recall, the search functionality was one of the last items on our To Do list for the workflow redevelopment in 2013. We ran out of time and money, so it was left undeveloped (or very basically developed).

All our software development since has been done by unpaid volunteers. So it has been left on the tree while easier, lower-hanging fruit has been worked on, depending on the volunteer's time, ability, and inclination. :)
School fiction: David Blaize
America Exploration: The First Four Voyages of Amerigo Vespucci
Serial novel: The Wandering Jew
Medieval England meets Civil War Americans: Centuries Apart
TedL
Posts: 570
Joined: October 24th, 2022, 3:06 am
Location: Wisconsin
Contact:

Post by TedL »

TriciaG wrote: February 29th, 2024, 6:30 am Yes, as I recall, the search functionality was one of the last items on our To Do list for the workflow redevelopment in 2013. We ran out of time and money, so it was left undeveloped (or very basically developed).

All our software development since has been done by unpaid volunteers. So it has been left on the tree while easier, lower-hanging fruit has been worked on, depending on the volunteer's time, ability, and inclination. :)
2013 was 11 years ago. Don't you mean 2023?
redrun
LibriVox Admin Team
Posts: 2941
Joined: August 11th, 2022, 8:32 pm
Contact:

Post by redrun »

TedL wrote: February 29th, 2024, 7:15 am 2013 was 11 years ago. Don't you mean 2023?
Not a typo, and not a joke. I believe Tricia mentioned in the other thread that our code is on GitHub. That code rewrite came with a list of extremely helpful automations, and enabled some new features we've enjoyed ever since, but some things are still sub-optimal.
When I was first pointed to it (not quite "last year" anymore), I had just enough background knowledge to give a theoretical definition of an "M-V-C framework". Now I know some more about our code in particular, with enough background to not touch user input and SQL queries at the same time.

Like wiring a house for electricity: it's done safely many times a day, but you should either know exactly what you're doing, or operate under the close supervision of someone who does. And then test it, to be sure. :wink:
I'll be out for a bit on this last weekend of April, but still checking in as I get the chance. I will try to follow up on Monday, with anything I can't do on the go.
TheBanjo
Posts: 1309
Joined: January 23rd, 2021, 8:19 pm
Location: Melbourne, Australia
Contact:

Post by TheBanjo »

Fascinating. I had no idea the code was public - nor what a complicated beast it is. Not really surprising though, when I think about it.

Looks like the crucial code in this case is at https://github.com/LibriVox/librivox-catalog/blob/master/application/libraries/Librivox_search.php

I believe the code that is resulting in OR-like behaviour for keywords is this:

Code: Select all

foreach ($keywords as $keyword)
				$escaped_keywords[] = $this->db->escape($keyword);
			$in_keywords = implode(", ", $escaped_keywords);

			$keyword_clause	 =	' JOIN project_keywords pk ON (pk.project_id = p.id) ';
			$keyword_clause	 .= ' JOIN keywords k ON (k.id = pk.keyword_id) AND k.value IN ('. $in_keywords .')  ';
where I believe the result of the JOIN will be to effect an OR-like result.

At face value, it appears that Librivox_search.php may have last been touched a couple of months ago.
Post Reply