Great machine reading, or popularising Mickiewicz

What would happen if artificial intelligence was trained on propaganda press texts? I shudder to think. That is why humans will always have to control AI. Including those who use algorithms to digitize the collection of the National Library of Poland

The hall is full to bursting. In a moment, a FutureTalks debate on artificial intelligence in the world of books is going to begin here. Artificial intelligence is a popular subject right now, so even here, at the October Book Fair in Cracow, it had to be included on the agenda. But for the bibliophiles who visit the fair each year this is a matter of concern. (“What the hell does artificial intelligence have to do with us? We, who so love the good old paper books with their smell and rustling pages that transport us to another dimension? We must find out what it is all about right now!”) One can easily see that in this company even e-book readers are barely tolerated. (“It’s not the same thing as a paper book… You can’t beat paper… Well, maybe on vacation because it takes much less space in your suitcase.”)

Artificial intelligence is the future, and the future of books – as some prophesy – is in question right now. Images replace written word. It’s true that Poles have never been the most avid readers, but these days they allegedly read even less. Most have switched from TV to Netflix-like platforms.

Literature – fresh, but sacred

“Will we still be reading in one hundred years? Will books survive?” Łukasz Wilczyński, the debate’s moderator, began somewhat dramatically.

“Well, books are just data,” Łukasz Kozak*, the digital collection expert from the National Library in Warsaw, replied with a shrug. “Whether these data are written down or recorded in memory, whether we are reading from a papyrus, a paper book, a stone tablet, or an e-book reader is unimportant. It is still the same data, data which we interpret.”

The audience turned white when they heard this and their hearts skipped a beat.

“In the perspective of the history of humanity paper books are still a very new invention,” Kozak continued. “Literature is a very fresh thing, and written literature is actually still very new! We are actually speaking here about a certain type of intermediality. If we consider books in this way, then the prospect of them dying is just an illusion.”

Does artificial intelligence really read books?
“It does. And it is a boon,” assures Łukasz Kozak from the National Library

“Then what will libraries of the future look like?” the moderator asked with worry.

“I think you’re asking what they already look like,” Kozak countered. “Today, every self-respecting library digitizes its collection. Every librarian already knows what OCR, or Optical Character Recognition, is. It is a process through which we can analyze the visual form of writing. However you look at it, text is also an image. It is a change that has already taken place, but not everyone knows about it.”

Artificial intelligence vs. “Janko the Musician”

“Does artificial intelligence really read books?” the moderator probed.

“It does. And it is a boon to us,” Łukasz Kozak assured him. “It helps us to find the things we are looking for, just like Spotify helps us to find the one specific song or band in the sea of music. When the Polona digital library contained 40, 100, or even as much as 200,000 digitzed books, I could roughly tell what was in there. I was able to wrap my mind around it all somehow. But today, when there are 3 million books there, there is no chance I could still do it.”

Łukasz Kozak is certain that artificial intelligence will be there for our pleasure just like the internet is.

“The literary canon that we have today is outdated. Children still have to read ‘The Knights of the Cross’ and ‘Janko the Musician’. We all had books from the obligatory reading list at school that we didn’t like. That’s why we resorted to book summaries. Knowing what I do about some of the masterpieces of positivism, I am not one bit surprised. I am convinced it would be better to just read the summary. AI will help us with that,” he remarked.

“Alright, so artificial intelligence does read books. But does it understand what it is reading?” the moderator expressed his doubts.

“It does, but not perfectly just yet. The so-called sentiment analysis enables it to, for example, identify whether we are dealing with hate speech or description of a date. It’s true that it can still be cheated quite easily but it is learning very quickly.”

10 million documents

At the National Library in Warsaw artificial intelligence is taught by the team of Sonia Wronkowska** from the IT Systems Unit at the Department of Information Technology of the National Library. Łukasz Kozak works with this team.

If books are data, then the National Library is a giant data base. 10 million documents – there is not enough time in one person’s life to read through even a tiny bit of what can be found here. The library collects not just books, but also magazines, manuscripts, music prints, maps, postcards, photographs, iconography, as well as electronic, audio, and audiovisual documents. Everything that has been published in Polish.

“We are responsible for the development and maintenance of various IT systems that are used by the National Library and its readers: the Polona digital library, the digital repository, the integrated search engine,” says Sonia Wronkowska.

Digital national treasury

Large-scale digitization at the National Library began in 2006.

“At the time, our main task was to secure the collection,” she says. “We had to have a digital copy of the most precious and the most sensitive items in order to limit their physical use. This included, among other things, the so-called treasure of the National Library, or the most valuable manuscripts, relics of key importance to Polish culture and Polish language. In 2013 we decided that we would continue digitizing not only to secure the most valuable items. We were doing it also to make all books available to the readers. After all, this is the purpose of the National Library. We are the only resource that is so rich in Polish literary works. We receive two copies of everything that is published in Poland, we have also collected publications released in Polish abroad since 1928, that is from the date that the National Library was established.”

Mass digitization started. Polona – the digital National Library – was created.

“In 2013, we had several tens of thousands of items digitized and made available online. Today, there is more than 3 million of them, which makes us one of the largest digital libraries in the world,” Wronkowska says. “The same thing is done by the most important national libraries across Europe. Right now, the largest digital library is the French Gallica, which started this process several years before us.”

“But since our library contains approximately 10 million documents, we still have a lot of work to do. But our dreamed democratization of information has become a fact,” Łukasz Kozak says with a smile. “Everyone with access to the internet can open Polona and read ‘Saint Florian Psalter’ or Frederic Chopin’s manuscripts.”

OCR, or how it’s done in Warsaw

During the most active periods up to two thousand documents are digitized daily at the National Library. How is it done?

“It begins with the traditional work of an analogue librarian who describes the document pursuant to the librarian’s craft, or – as we say in our jargon – provides item metadata,” Sonia Wronkowska explains. “Next, the document is digitized. Of course, the document is also subject to a conservator’s inspection; the conservator also decides which digitization device should be used for the given document. The most valuable items are photographed.

“Next, the files are uploaded to the library’s core, or the digital repository. There the metadata produced by the librarian are associated with the scans and a digital object is created. This object has to be processed, so pagination is introduced and it is subjected to the OCR, or optical character recognition, process. A scan is an image file, a picture, that represents text. An OCR system translates it into a computer-readable format. This way the computer can recognize the text on the page and understand that it is different from numbering, a table, or a margin. It also enables the users to search through the document contents.

“After all, not everything that people are looking for can be found in the title or the descriptive metadata provided by the librarian. For example, a user may wish to find all mentions of his hometown in the library. So he would enter its name into Polona and expect to see in the results all books in which this town is mentioned. Such additional data are provided by OCR.”

Old prints are unavailable for machines

After all that processing the object ends up in the repository. It is subject to long-term archiving in tape libraries in highly secure conditions. It is uploaded to Polona where anybody can access it from any place and at any time. The repository is also used by other institutions that do not have their own system or wish to expand their reach, like for example the Ethnographic Museum, the Frederic Chopin Institute, or the Jagiellonian Library.

However, a machine can only “read” those books that are printed with standard type, which means those from the 19th and 20th centuries. Only those volumes can undergo the standard OCR process. Earlier works – old prints and manuscripts – remain in a grey area. There are no ready tools for them. Therefore, at this moment, old prints remain unavailable not only for machines, but also for the blind and for those who do not speak Polish. Meanwhile, the National Library contains such rarities as the oldest surviving piece of Polish prose – “Kazania świętokrzyskie” (The Holy Cross Sermons) – from the 13th century, a manuscript of the chronicle of Gallus Anonymus, “Saint Florian Psalter”, “Rozmyślanie przemyskie” (Meditations on the life of Lord Jesus), “Rocznik dawny” (The Old Annals) containing a note on the baptism of Duke Mieszko, lots of manuscripts by C.K. Norwid, K.K. Baczyński, the only surviving manuscript by Jan Kochanowski etc.

Will artificial intelligence be capable of reading them in the future? That’s what Jacek Tlaga*** is working on. He is an enthusiast of linguistic engineering with a passion for phonetics and phonology.

Jacek Tlaga has an idea

“We wanted to follow in the footsteps of what was happening around the world, and so we started working on artificial intelligence,” says Sonia Wronkowska. “We have already worked with Jacek, who is an old print expert, during projects involving our oldest works. For example, he reconstructed for us the original pronunciation of ‘Bogurodzica. We were familiar with his interests and capabilities. We managed to hire him for our project.

“An idea struck me how the oldest Polish prints could be read and converted into a format that would be readable to a computer software – namely text files,” says Jacek Tlaga. “I’ve been doing this for the past year and a half. It is a pilot project, an experiment, so I actually don’t know yet how it’s going to end. We have no deadline. Rather, we’re just testing the grounds, trying to see what can be done. Other libraries around the world are doing the same thing right now: establishing small research teams that study how things stand. It wasn’t possible earlier, but deep neural networks have emerged and suddenly it turned out that the technology needed is at arm’s reach. I am trying to teach a machine how to read historical texts that do not have normalized type. Unfortunately, the Schwabacher type that Polish texts were printed in is difficult to read. Furthermore, old prints frequently contain misprints. The 16th century printers frequently made mistakes.”

The machine reads, but it doesn’t understand

“How is it done in practice?” I ask.

“I use different tools. Can we actually say that a machine understands the text? Of course it doesn’t. It only understands the text as a string of consecutive characters. It is reading without understanding. An attempt to understand the text takes place afterwards. That is why, at first, the algorithm only recognizes what is a text and what is an image; what is a letter and what is only an ornament,” Tlaga explains to me. “For this purpose I use convolutional networks that recognize images. Then, I use recurrent networks that analyze sequences, or text. The network recognizes proper names, first names, geographic names, names of institutions.”

“Alright,” I say to myself, “but if we have, let’s say, a 16th century edition of Jan Kochanowski’s threnodies in which the initial of each poem looks like a tiny picture, will the software be able to understand what it is?”

“Decorative initials are a big challenge, yes,” Mr Tlaga admits. “Working with artificial intelligence largely consists in experimenting, checking to see what’s going to happen and what will work.”

Line after line, tens of thousands

Tlaga does not explain to the algorithm what is what. He only provides it with an image, gives it a text, and the algorithm learns by itself to assign the labels properly. A lot of data are required for this, but at the National Library data availability is not a problem. One just needs to prepare these data properly so that they turn into learning data. Each image must contain only a single line of text with specific characters one after the other. So first the letters need to be manually written down, specifying whether it is the main body of text, a title, an initial, or a signature. It is an arduous work, because tens of thousands of such lines of text are needed.

Working with artificial intelligence largely consists in experimenting, checking to see what’s going to happen and what will work

“The more of them we get, the stronger and the more effective the algorithm will be,” states Tlaga. “I am currently working on a platform that will allow us to input these data easily and quickly; right now I am at the prototype development stage.”

At first the algorithm was making many mistakes that had to be corrected manually. But, over time, the number of mistakes has decreased. Now, after one year of collecting data and training, the algorithm is doing quite well in reading the Polish Schwabacher type, it has become very effective in reading Jan Kochanowski’s epigrams. It can also recognize which printing shop produced the book.

Librarians will never go extinct

“Do you know what libraries of the future will look like? What kind of place will the National Library be in 20 years’ time? Will there still be place for librarians in it?” I ask.

“I wouldn’t worry about that,” Sonia Wronkowska tries to reassure me. “The National Library is a huge institution. 700 people work here, and we still have a lot of work to do with the collection that was entrusted to us. Traditional collections are not available for blind people or those who do not speak Polish. Meanwhile, anybody can paste digital text into a translating software to get the general idea what it’s about. Google’s Indexer also cannot cope with images without description.

“Then, add to these the music collection written in musical notation. Why shouldn’t AI enable users in the future to play back music from a printed score?

“I believe that librarians will just have slightly different duties,” adds Wronkowska. “Artificial intelligence will help them to deal with the most arduous work. A machine cannot replace a human, but it can help them to save time by, for example, generating keywords that describe a book’s contents.

“We are really counting on the artificial intelligence tools to support us.”

“Are we going to make Mickiewicz more popular around the world?”

“Yes. We put a lot of stress on saving data in international standards so that different systems can make use of them. After all, machine translation software is becoming better and better.”

Culture, or the advantage of rubbish and stupidity

“We read less and less… Will digitization and online access to library collections improve this situation?” I ask.

“We do not read less at all. We read more, but differently. After all, each and every one of us is constantly reading something on the phone.”

“But those are rather short texts.”

“That’s true, but text messages aren’t that much shorter than Kochanowski’s epigrams. Making the National Library’s resources available in digital form will make it possible to read epigrams on the phone. Furthermore, we digitize resources shelf after shelf, no censorship, no selection. We digitize books that say the Earth is flat, curious old medical guides, and German propaganda from the time of occupation during World War II. Polish culture isn’t limited to Mickiewicz and Kochanowski. From today’s perspective, a large part of our collection consists of books that aren’t necessarily smart or good. This is the actual image of culture and cultural heritage.”

“Nobody is telling this to kids in school.”

“That’s true. In this case we are also arriving at the issue of data bias. What would happen if we trained a model on historical newspapers, on books that contain knowledge that is completely outdated and inconsistent with modern standards, and frequently harmful? How would it then classify various phenomena? One shudders to think, don’t they? That is why humans will always have to control artificial intelligence. Even if it becomes capable of carrying out nearly all tasks, someone will have to oversee it. Because it is as naive as a child. If someone speaks to it using sentences from propaganda press, that is what it will learn.”

*Łukasz Kozak – medievalist and technology expert. He has been collaborating with the National Library of Poland on creating and developing digital services for 8 years.

**Sonia Wronkowska has worked at the National Library for more than 6 years now, and currently holds the position of the IT Systems Unit’s manager. A musicologist by education, a librarian by trade, she is studying information technologies at the Polish-Japanese Academy of Information Technology. Associated with RISM (Répertoire International des Sources Musicales) and several projects related to digital humanities, she is involved in MEI (The Music Encoding Initiative) and III F (International Image Interoperability Framework). In free time she does some editing and researches early music.

***Jacek Tlaga has worked at the National Library since 2018, and he takes care of automated analysis of historical documents. He is engaged in the research and development of various artificial intelligence tools. He has many years of experience in data mining as well as processing and analysis of signals and images. He has a passion for reconstructing the way linguistic monuments sounded, and to this end he conducts linguistic research in which he also uses digital methods.

Przeczytaj polską wersję tego tekstu TUTAJ