This blog post will evaluate the strengths and weaknesses of the approaches used by the British Newspapers 1600 – 1950 and British History Online, mainly through the debate of OCR (optical character recognition) and double re-keying. The British Newspapers 1600 – 1950 is a digital archive of over three million pages of historic newspapers, newsbooks, and ephemera from national and regional papers. It is a project that brought together two collection; The 17th and 18th century Burney collection, and British Newspapers 1800 – 1900, the collection also contains newspapers from many British colonies. British History Online is a digital archive that contains collections form museums, libraries, and archives that pertain to British history. Both of these projects brought together different collections and digitised most of the sources to be viewed online, however they both went about digitising them in different ways.
British Newspapers, as mentioned above, have digitised over 3 million pages of British newspapers, and they accomplished this great task by using Optical Character Recognition. OCR involves scanning a source, which is then turned into text via software, this is then output for the consumer. This is obviously a very simplified explanation of the process, and there are many factors that affect it; such as the resolution of scans taken (300 – 600 dpi is suggested), automatic paper feeders in order to speed up the process, etc. While OCR is a fast process that can process documents with incredible speed this does come at the price of accuracy. The largest problem with processing a document with OCR, for history anyway, is that OCR is optimised for laser printer quality text, this is a problem in history as most of the texts we want digitised are usually hand written, although this project on British Newspaper would have suffered less than others do. Another problem that can arise from text that has been processed with OCR is the misunderstanding of letters, especially the long s. This makes keyword searching through the output text quite difficult, this is because the software does not know that some letters are not the same as they are now. Although a thorough search using different variations, or a wild card search can overcome this, it is still a flaw in the procedure.
British History Online did not use OCR to process its documents, instead it used a process called double re-keying. Double re-keying is a process for digitising documents by using two different typists that manually input text from said document. These two different transcripts are then compared and the differences are resolved manually. This process is obviously more much more accurate that OCR, as both typists are very unlikely to make the same mistakes, and these are picked up during the checking phase. This process boasts an accuracy rating of 99% when compared with the original text. However, this accuracy also comes at a price, with it being very expensive it is not a feasible option for many projects. Also it is a very slow process, so if you have millions of documents to process like British Newspapers did then double re-keying might not be the digitising process for you.
In conclusion both processes have their strengths and weaknesses. OCR is fast and (relatively) cheap but suffers from accuracy problems, while double re-keying is very accurate it is also expensive and a slow process. It is also important to not these are not the only two digitising processes, there are many more in-between these two extremes. Depending on your project, how many documents you have to digitise, your budget, and time constraints will dictate which process is best for you.