Page 1 of 1

[FIXED] Japanese Fairy Tales: Typo in section 10 title

Posted: December 30th, 2020, 1:51 am
by dekymo
In the audiobook "Japanese Fairy Tales" by Yei Theodora OZAKI (1871 - 1932)
https://librivox.org/japanese-fairy-tales-by-yei-theodora-ozaki/
there is a typo in the title of section 10: "The Mirror of Maysuyama"
should be "The Mirror of Matsuyama".

The MP3 file
https://www.archive.org/download/jap_fairytales_0808_librivox/japanesefairytales_10_ozaki.mp3
has an ID3v1.1 tag (at end of file) and an ID3v2.3 tag (at beginning of file).
The title error is present in both.

Additionally, in the v2.3 tag, the artist & album tags are corrupted.
The corruption is visible in the preview/playlist window on the Internet Archive page, where for section 10/track 11 instead of the author (artist) name you see random Chinese characters.

Re: Japanese Fairy Tales: Typo in section 10 title

Posted: December 30th, 2020, 4:46 pm
by knotyouraveragejo
Hi dekymo

Thanks for letting us know. I've fixed the typo on the catalog page. The problem with the ID3 tags is a little more complicated to fix. Will post when it is fixed. This is quite an old project cataloged in 2008. Was this the only file affected?

Re: Japanese Fairy Tales: Typo in section 10 title

Posted: December 31st, 2020, 9:10 am
by dekymo
knotyouraveragejo wrote: December 30th, 2020, 4:46 pm This is quite an old project cataloged in 2008. Was this the only file affected?
The artist/album corruption is only present in section 10 of the audiobook "Japanese Fairy Tales", none of the other files are affected.

However, I checked sections by the same reader (Clarke Bell) in other audiobooks and the same kind of ID3 tag corruption in present in some of their sections in other books:

section 84 ("The Villefort Family Vault") of The Count of Monte Cristo
https://librivox.org/the-count-of-monte-cristo-by-alexandre-dumas/

section 21 ("THE COUNTESS DE WINTER") of The Three Musketeers
https://librivox.org/the-three-musketeers-by-alexandre-dumas/

(section 37 in the same book, and the sections by Clarke Bell in other books look OK).

Re: Japanese Fairy Tales: Typo in section 10 title

Posted: December 31st, 2020, 10:56 am
by knotyouraveragejo
This was happening in the past for a while on some files, and if I remember correctly it had to do with what software used to add the ID3 tags to the original files. If you download the file and check the tag in Windows, they look fine as does the archive.org metadata. I would have to download the files, replace the tags and reupload. This is a little more work for these older projects since archive has changed how they derive the files since then.

Re: Japanese Fairy Tales: Typo in section 10 title

Posted: December 31st, 2020, 2:41 pm
by knotyouraveragejo
The mp3 files are fixed for The Japanese Fairy Tales. The ogg file for section 10 is now missing, but I doubt if anyone will notice. Newer projects no longer have these files at all.

As for the others, I'll see about them sometime when I have time. The extra characters on the archive page do not affect streaming or downloading the files.

Re: Japanese Fairy Tales: Typo in section 10 title

Posted: December 31st, 2020, 3:23 pm
by dekymo
knotyouraveragejo wrote: December 31st, 2020, 2:41 pm The mp3 files are fixed for The Japanese Fairy Tales. [...]
As for the others, I'll see about them sometime when I have time. The extra characters on the archive page do not affect streaming or downloading the files.
Great! I agree with you that the ID3 tag issue isn't a problem in practice for most people.

I tried to investigate what the cause of the problem was; I include what I found out for the sake of completeness/for future reference, feel free to ignore this post..

I downloaded japanesefairytales_10_ozaki.mp3 (the broken version) and looked at the ID3v2.3 tag data at the start of the file in a hex editor:

Code: Select all

00000000: 4944 3303 0000 0000 1131 5441 4c42 0000  ID3......1TALB..
00000010: 002f 0000 01ff fefe ff00 4a00 6100 7000  ./........J.a.p.
00000020: 6100 6e00 6500 7300 6500 2000 4600 6100  a.n.e.s.e. .F.a.
00000030: 6900 7200 7900 2000 5400 6100 6c00 6500  i.r.y. .T.a.l.e.
00000040: 7300 0054 5045 3100 0000 2b00 0001 fffe  s..TPE1...+.....
00000050: feff 0059 0065 0069 0020 0054 0068 0065  ...Y.e.i. .T.h.e
00000060: 006f 0064 006f 0072 0061 0020 004f 007a  .o.d.o.r.a. .O.z
00000070: 0061 006b 0069 0000 5452 434b 0000 0006  .a.k.i..TRCK....
00000080: 0000 0031 312f 3232 5443 4f4e 0000 0006  ...11/22TCON....
00000090: 0000 0028 3130 3129 5449 5432 0000 001d  ...(101)TIT2....
000000a0: 0000 0031 3020 2d20 5468 6520 4d69 7272  ...10 - The Mirr
000000b0: 6f72 206f 6620 4d61 7973 7579 616d 6100  or of Maysuyama.

You can make sense of this using the ID3v2.3 specification: https://id3.org/id3v2.3.0

Starting towards the end of the first line Thus you would expect the contents of the album title to follow in UTF-16-LE.

Instead you get what looks like another byte order marker (fe ff), which would indicate UTF-16 big-endian, followed by the title in big endian encoding (00 4a 00 61 00 70 ... = "Jap"...)

Thus the correct data is there, in UTF-16-BE, but it's preceded by a bogus UTF-16-LE byte order marker, which is causing the contents of the field to be misinterpreted.

Specifically, the first four bytes (fe ff 00 4a) are being interpreted as which corresponds to what I was seeing displayed in place of the album name: a glyph representing the BOM, followed by a string of random Chinese characters, starting with 䨀.

I haven't looked at the other MP3 files, but I imagine the cause is the same there too.

Re: [FIXED] Japanese Fairy Tales: Typo in section 10 title

Posted: January 1st, 2021, 2:33 am
by Peter Why
I'm glad that there's a sensible explanation for the Chinese ID3 display on archive.org. It's fairly common. The opening screen for my recording of Alice has it, too, and I've seen others: https://archive.org/details/alice_wonderland_0711_librivox

Peter

Re: [FIXED] Japanese Fairy Tales: Typo in section 10 title

Posted: January 1st, 2021, 4:03 am
by annise
It is only on some projects catalogued before our software update - we add the ID tags in the validator so the different character sets used do not get to Archive

Anne