OCR Vs. Double Re-keying

This blog post will evaluate the strengths and weaknesses of the approaches used by the British Newspapers 1600 – 1950 and British History Online, mainly through the debate of OCR (optical character recognition) and double re-keying. The British Newspapers 1600 – 1950 is a digital archive of over three million pages of historic newspapers, newsbooks, and ephemera from national and regional papers. It is a project that brought together two collection; The 17th and 18th century Burney collection, and British Newspapers 1800 – 1900, the collection also contains newspapers from many British colonies.[1] British History Online is a digital archive that contains collections form museums, libraries, and archives that pertain to British history.[2] Both of these projects brought together different collections and digitised most of the sources to be viewed online, however they both went about digitising them in different ways.

British Newspapers, as mentioned above, have digitised over 3 million pages of British newspapers, and they accomplished this great task by using Optical Character Recognition. OCR involves scanning a source, which is then turned into text via software, this is then output for the consumer. This is obviously a very simplified explanation of the process, and there are many factors that affect it; such as the resolution of scans taken (300 – 600 dpi is suggested), automatic paper feeders in order to speed up the process, etc.[3] While OCR is a fast process that can process documents with incredible speed this does come at the price of accuracy. The largest problem with processing a document with OCR, for history anyway, is that OCR is optimised for laser printer quality text, this is a problem in history as most of the texts we want digitised are usually hand written, although this project on British Newspaper would have suffered less than others do.[4] Another problem that can arise from text that has been processed with OCR is the misunderstanding of letters, especially the long s. This makes keyword searching through the output text quite difficult, this is because the software does not know that some letters are not the same as they are now. Although a thorough search using different variations, or a wild card search can overcome this, it is still a flaw in the procedure.

British History Online did not use OCR to process its documents, instead it used a process called double re-keying. Double re-keying is a process for digitising documents by using two different typists that manually input text from said document. These two different transcripts are then compared and the differences are resolved manually. This process is obviously more much more accurate that OCR, as both typists are very unlikely to make the same mistakes, and these are picked up during the checking phase. This process boasts an accuracy rating of 99% when compared with the original text.[5] However, this accuracy also comes at a price, with it being very expensive it is not a feasible option for many projects. Also it is a very slow process, so if you have millions of documents to process like British Newspapers did then double re-keying might not be the digitising process for you.

In conclusion both processes have their strengths and weaknesses. OCR is fast and (relatively) cheap but suffers from accuracy problems, while double re-keying is very accurate it is also expensive and a slow process. It is also important to not these are not the only two digitising processes, there are many more in-between these two extremes. Depending on your project, how many documents you have to digitise, your budget, and time constraints will dictate which process is best for you.

[1]British Newspapers, 1600-1900’, Connected Histories, consulted on 27/04/2015.

[2]About British History Online’, British History Online, consulted on 27/04/2015.

[3] Daniel J. Cohen & Roy Rosenzweig, ‘Becoming Digital: How to Make Text Digital’, Digital History, consulted on 27/04/2015.

[4]Creating and Digitising Electronic Texts’, AHDS, consulted on 27/04/2015.

[5]About This Project’, London Lives 1690-1800, consulted on 27/04/2015.

Word Clouds and History

This blog will be reflecting on word cloud generators, their strengths and weaknesses, and their potential use for history. Word clouds have risen in popularity over the last few years and are generally seen all over the internet, used by many organisations. This is because word clouds offer a very cheap way to represent a quantitative element of a corpus in an easy to understand visual way. They work by breaking down a corpus into individual words and counts how many times they appear, it then represents this by increasing the size of the word by how many times it appears in the original corpus. This has the benefit of finding words and potentially themes though out a very large corpus, that manually would take years to do. For history this is an obvious benefit as working with big data allows patterns to emerge that we would never find without computing. A good example of this is a word cloud produced by the Journal of American History on the 1858 Lincoln – Douglas debates.[1] From this word cloud we can see what the most hotly contested topics were during the debates in a much faster and easier way than reading through the entire transcript of the debates.

The benefits of word cloud are quite numerous, not only for history, but also academia as a whole. One of the biggest benefit for history is the amount of time they can save compared to doing the same task manually. Although there are drawbacks to the conclusions you can reach via word clouds, they do fulfil a certain role that can only add to historical research. Another benefit is that they are very visual and very easy to interpret even if the reader does not know anything about the original corpus. This can be very helpful when presenting historical data to people who do not have an education within academia.

Even though thus far this blog has been showing the benefits of word clouds, there are many downsides to using them. One of the largest weaknesses to using word clouds in history is that you lose context via the visualisation.[2] An example could be of a world cloud of Captains journals of ships under the East India Company, the word crewman might appear a lot, but we have no idea why, is the captain complaining, or is he praising. Another problem associated with this is raised by Adam Crymble who states that we think in metaphors and ideas, not in words, this can be a problem when the word cloud reduces them to individual words that mean something different to what it does in the original corpus.[3] Another flaw, in terms of historic research, of word clouds is the fact that frequency does not always equate to importance. Although most word cloud generators include a function to remove many common words such as ‘and’ and ‘the’, this does not help the problem that some words will top the frequency and not be helpful for understanding the original corpus.

In conclusion, whether word clouds can be helpful or not really depends on what purpose you want them to fulfil. If you expect too much from word clouds you will probably find them wanting, they are best used for simple tasks in their current form, but an interactive version that showed the data behind the words would be a much more viable tool for historic research. But for now there are better visualisation tools that are more suited for tasks that are more complicated.

[1] 1858 Lincoln-Douglas Debates, Journal of American History, consulted on 26/04/2015.

[2] S. Graham, I. Milligan, & S. Weingart, ‘Basic Text Mining: Word Clouds, Their Limitations, and Moving Beyond Them’, The Historians Macroscope: Big Digital History, consulted on 26/04/2015.

[3] Adam Crymble, ‘Can we Reconstruct a Text from a Word Cloud?’, Thoughts on Public & Digital History, Consulted on 26/04/2015.

3D Modelling and History

This blog post will be critically analysing the project of London 1666, a 3D modelling of London before the great fire of 1666, made in the Crytek engine as part of the ‘Off the Map’ competition run by the British Library, Crytek, and Gamecity. This blog will also be discussing the use of 3D modelling in history and what it has to offer historians.

3D modelling is a recent technological advance, commercially anyway, and has had an impact on many aspects of life, but until recently history was not one of those. However, 3D modelling does have certain benefits to it that can improve aspects of history, one of the most obvious uses of it is in museums. Museums have started to use 3D modelling to make copies of objects and artefacts and display them instead of the original. The Smithsonian has recently started to replace many of their artefacts with 3D models and have provided an online viewer so they can be seen by millions on the internet.[1] Many institutions have also been scanning their artefacts and making the file available online so that anybody with a 3D printer can download them and print them.[2] Virtual 3D modelling has also recently taken off as a method to show historic places, it has mainly been used by the heritage sector thus far but shows promise elsewhere as well.[3] Video games have been making use of 3D environments for decades, most famously for mapping in Sim City and other city building sims. While this has not been applied to digital history much in the past, it has been used in other sectors, and is now being explored by digital historians.

The project London 1666 was designed by six Game Art Design undergraduates at DeMontfort University. They recreated London just before the Great Fire using the Crytek engine. The Crytek engine is most famous for its uses in PC video games such as Far Cry, and boasts to be one of the most graphically advanced engines. The aim of the competition called ‘Off the Map’ was to recreate maps in the most imaginative ways possible using the CryEngine 3, the maps were selected from the British Library’s cartographic collection and could depict anything from London to fantasy worlds. The London 1666 project used the maps of 17th century London, paintings, and surviving buildings to recreate the streets and houses. While this project did eventually win the competition and has been praised by many people, there are some questions that have to be asked. One of the big problems of this project is the question of authenticity. From their video it is clear that the team used a little bit of artistic license when it came to populating the streets with props, these were added to make the different streets look different from each other.[4] While many of the props were based on documented items, which does not mean that those props were in Pudding Lane, for instance they used tavern signs to add a little more flavour to the streets, however they used tavern signs that had been documented from anywhere in London. All that being said it was still a great feat and they deserve all the praise they have received. They painstakingly recreated the wattle and daub buildings, many of which were very different shapes and sizes. They also went through a lot of material at the British Library and researched the topic very well, and this shows in the final product. Finally they should be praised for their use and handling of the CryEngine, it is no easy task to transfer their research into their final product, but they did it admirably.

[1] New York Post, ‘Smithsonian Creating 3D Models of historical Artefacts’, http://nypost.com/2013/11/13/smithsonian-creating-3d-models-of-historical-artifacts/; consulted on 22/04/2015.

[2] Mathew Williamson, ‘BoingBoing’, Open-source 3D Scans of Museum items generate amazing new creative works, http://boingboing.net/2015/01/02/open-source-3d-scans-of-museum.html; consulted on 22/04/2015.

[3] Cara Ellison, ‘Rock, Paper, Shotgun’, CryEngine & The British Library, http://www.rockpapershotgun.com/2013/12/20/cryengine-the-british-library-2013s-unusual-team-up/; consulted on 22/04/2015.

[4] Pudding Lane Productions, ‘Populating’, http://puddinglanedmuga.blogspot.co.uk/; consulted on 22/04/2015.

A Critical Evaluation of ‘History Learning Site’

This blog will be critically analysing the website ‘History Learning Site’ and assessing its strengths and weaknesses in regards to its use as a ‘trustworthy’ resource.[1] As mentioned in my previous blog ‘How Can Programming be Useful for Historians’ historians are very thorough when it comes to the integrity of their sources, and nowadays the internet is brimming with sources for the historian to use. However many an historian are still very apprehensive about using websites in their research as they usually do not fit into the rigid framework that sources from books or journals do. Websites often do not tell the reader where the information on said website came from, which automatically excludes it from many historians field of view. This is best personified by the famous internet hoax of ‘The Last Pirate’ where undergraduate Jane Browning of George Mason University, as part of their Digital History course, made up a story of Edward Owens, a fisherman turned pirate in the late 19th century. Browning edited Wikipedia pages, made interviews with professors that didn’t exist, and kept a blog where she updated everybody on her research. This fooled many people and was only pulled down when the Reddit community proved Browning’s work to be false.

The first thing you notice when you arrive at the History Learning Site is the sheer amount of advertisements on [2]the front page. Counting them, there are a total of six large and very obnoxious advertisements, five of which are non-history related, one of which recommends for me Adolf Eichmann. This automatically starts alarm bells ringing for a variety of reasons. One of them being that this site is obviously meant to earn money via these ads, which suggests it has no other financial backing such as grants. The ads are also very distracting and make the website feel very unprofessional and non-academic. The topics that the website cover are very disparate and it does not feel as though there is any thread connecting any of them. The topics vary wildly from ancient Rome through to the Civil Rights Movement in America, and while some topics supposedly cover very broad subjects such as ‘Britain 1600-1900’, others are weirdly specific, such as ‘The Counter Reformation’. Although the website covers quite a few topics, it is obvious that its main demographic is supposed to be GCSE students, as many of the topics cover GCSE modules such as: Medicine through Time, and The Rise of Nazism.

Also of question to the websites’ ‘trustworthiness’ is who made the website and why was it made. This is important as the person, or organization that created the website have a motive for doing so, and that can tell us a lot about the website.[3] As the About the Author page tells us, the original Author of the website was one Chris Truman who had a BA (Hons) in History from Aberystwyth University and an MA in Management from Brighton University. He also apparently taught history at both a basic and advanced level, although it does not clarify whether this was at a school, college, or if he just dragged random people of the street and forced them to listen to him. While having a BA in history is a suitable qualification to have for this project he started, it is obvious that either he did not listen much when his lecturers were explaining the importance of referencing, or he had a very specific project in mind, on that did not require references. I have to believe it is the latter, this is backed up by the fact that this website is obviously aimed at GCSE level students, who do not require footnotes for any of their work.

[1] History Learning Site, http://www.historylearningsite.co.uk/; consulted on 21/04/2015.

[2] Lynn University ‘Teaching by Lying’, http://lynn-library.libguides.com/c.php?g=203952&p=1345781; consulted on 21/04/2015.

[3] UC Berkley Library, ‘Evaluating Web Pages’, http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html; consulted on 21/04/2015.

How can Programming be Useful for Historians?

This blog post will be offering up an argument as to whether programming can be useful for historians and their research. Programming is not a discipline you would normally associate with a subject such as history. Indeed it has only been in the last decade or so that programming has started to become intertwined with historical research. Many scholars still reject the use of most technological advances that may help them with their research, and among those advances programming is one of the more difficult to learn, but maybe one of the most rewarding.

One of the biggest benefits of learning a programming language is that you become aware of what kind of coding or algorithms go into a program, this can be very useful for search engines. Nowadays most historians do at least some research via Google scholar, and the student most likely does most of theirs through it, so understanding how Google fetches its results is of great importance. Historians are very thorough when it comes to the integrity of their sources, so it seems logical that they should apply the same rigorous approach to an algorithm that decides what sources are the most visible, or appear at all. Also, if the historian knows the algorithm of a searchable database them they can optimise their searches to find the most relevant data quickly.[1] Another benefit of learning to program is having the ability to create your own programs, although add-ons for existing programs is usually more than enough. This can allow the historian to work with, what has been dubbed, ‘big data’. Big data refers to massive amounts of data that would be near impossible for a historian to process manually.[2] Therefore programs are created that do whatever you want to do with the data automatically, cutting the time down from an entire career, to a few hours. However, there probably isn’t a program out there that does exactly what you want it to do, this is where programming comes in. Programming allows you to create the program or add-on that will enable you to manipulate the data in the way you want to.

There are, however, many downfalls to programming’s use in history and reasons why we only see digital historians taking up the mantle of programmer. The biggest reason why we do not see programming used more widely is that is it difficult to learn, and can be very time consuming. Also of concern is which language is worth spending the time and effort to learn, there are a lot of programming languages such as: C, Java, PHP, JavaScript, Python, C++, etc. Although many of these languages overlap, and certain languages are better at accomplishing different tasks, you will be limited to what you can accomplish if you only learn one language. Connected to this is the fact that learning a programming language is basically the same as learning an actual new language, and one must decide whether learning coding is worth more than learning a new language.[3] Although this may seem a weak argument, putting into perspective the time and effort needed to become competent with a single programming language is important, especially when you have many other restraints on your time such as research, teaching, marking work, etc.

So, is programming worth the time and effort to learn? That is up to the individual and their needs, but I believe we should all at least try and appreciate the benefits programming can have on historical research.

[1] Doing History in Public, ‘Why Historians Should Learn to Code (at least a bit)’, http://doinghistoryinpublic.org/2014/07/03/why-historians-should-learn-how-to-code/; consulted on 20/04/2015.

[2] The Historians Macroscope, ‘The Historian’s Toolkit’, http://www.themacroscope.org/?page_id=330; consulted on 20/04/2015.

[3] THATCamp CAA, ‘Coding as a Foreign Language’, http://caa2013.thatcamp.org/02/12/coding-as-a-foreign-language/; consulted on 20/04/2015.

Transcription and Citizen Historians

This blog post will be concerned with transcription, especially the Transcribe Bentham and Old Weather projects. One of the biggest benefits for researchers in setting up a transcription project is that it allows huge amounts of data to be processed, whether you need it tagging, transcribing, or annotating, both quickly and cheaply.[1] While the benefits for the researchers are obvious, the benefits for the volunteers are more subtle. One of the benefits is the educational interest the sources might hold for the volunteer, they also might not have had access to them otherwise.[2] Other benefits include skills obtained such as interpreting historical documents, which is beneficial for the general public and the academic community.

Both Transcribe Bentham and Old Weather did many good things in regards to making the experience of transcribing as simple and enjoyable as possible. Old Weather was by far the most novice friendly to use, its magnifying tool was both quick and adaptable to use. It also made it very clear what you had already transcribed and easily editable in case of mistakes. Another aspect to Old Weather that was extremely useful was its forum. Within the forum there were many user made tutorials that were indispensible when trying to figure out both the weather and cloud codes. While Transcribe Bentham was not as user friendly as Old Weather it does have its benefits, and is more famous thanks to its advertising which helped build a core of dedicated transcribers.[3] The transcription tool used for Transcribe Bentham is significantly more basic than Old Weather, but is easy to understand. Both had adequate tutorials although, again, Old Weather had the better as it was more interactive and had you learning while doing.

The term citizen historian has been gaining a lot of ground recently, and is used to describe someone who helps out with history research, usually through transcribing primary sources. Tasks can also include georectifying historical maps, identifying people and places in photographs, oral history, and so much more, but transcribing gets the most attention. One way in which both of these transcription projects can help promote a better citizen historian experience is by introducing new users to the forum, and thus other transcribers as this is a good way to turn a “crowd” into a “community” (lightweight contributor to a heavyweight) .[4] Another way would be to set up a system of competition or achievements within the projects, and while Old Weather does this it could be built upon, as the older users are entrenched so there is little room for advancement. This competition would serve as impetus for more transcriptions per user.


[1] Harry Klinhamer, ‘Where are the citizen historians?’, Public History Commons, http://publichistorycommons.org/where-are-the-citizen-historians/; consulted on 15/03/2015.

[2] Jonathan Silvertown, ‘A new dawn for citizen science’, Trends in Ecology and Evolution, Vol. 24 No. 9 (2009).

[3] Tim Causer and Valerie Wallace, ‘Building A Volunteer Community: Results and Findings from Transcribe Bentham, Digital Humanities Quarterly, Vol. 6 No. 2 (2012).

[4] Ibid.

Historians and Twitter

This blog post will be concerned with the use of Twitter by academics, and its pros and cons. Since the advent of twitter its popularity has seen a meteoric rise in popularity, and people in all aspects of life have taken advantage of the website. Academics are no exception, and now they litter the website and have come to terms with many of the benefits and weaknesses of Twitter. Nowadays there are many an online guide for academics on the proper use of Twitter, one such example is the LSE’s ‘Twitter guide for academics’.[1] This comprehensive guide shows you how to set up a Twitter account, useful Twitter terminology, different tweeting styles, how to use twitter to help promote your research, etc.

One of the main benefits of Twitter is that Twitter is a very popular service, with over 500 million tweets sent a day.[2] With such a huge platform to shout from historians can be sure that their thoughts (as long as they are 140 characters or less) are heard by many people. It also has the benefit of reaching people outside of their regular readers by not having the same constrictive elements that traditional academic idea sharing does (journals, books, and conferences).

Another benefit, especially for academics, is that it is very fast to post to Twitter. This is obviously beneficial as it only requires the bare minimum of thought and planning and does not need to distract a person from their current work. It can also be a vent for random ideas and gauge the reaction of your followers.

Lastly, Twitter can be used to promote your own work, and more importantly, other peoples work/projects. Jennifer Evan’s twitter is constantly filled with re-tweets promoting other historians blogs.[3] This helps the online history community grow as people who follow individual historians can become aware of other people’s twitter that they might find interesting.

 

As mentioned previously twitter is restricted to 140 characters or less. This can become a problem if you are trying to convey a message or idea that just won’t squash down to the character limit. Worse, trying to fit your idea to that limit might lead to people misinterpreting it.

Probably my greatest worry over posting on twitter (and here) is that people might actually read it. I realise that that is usually the point, but the internet can be a very unkind place. Anonymously people can reply to your tweets however they want saying many cruel things about your goldfish, or whatever offends you personally the most.

Linked to the above point is that Twitter is dangerous because people, that means you as well, can be incredibly stupid at times. We all have bad days, and say stupid things, but do that on twitter and not only do thousands of people get to watch (popcorn is optional), it is also usually preserved by some brave soul via print screening. This can lead to, at best, minor embarrassment and at worse you losing your job and credibility.

So, there we go, my pros and cons of Twitter as a tool for historians. Like every tool if you treat it with respect and know its benefits then you’re probably doing it right, well done.

In conclusion, Twitter can be a very useful tool for academics, and offers new opportunities for academics and their work. However, if misused it can also be dangerous to the user, and their career.

[1] London School of Economics and Political Science, http://blogs.lse.ac.uk/impactofsocialsciences/files/2011/11/Published-Twitter_Guide_Sept_2011.pdf; consulted on21/01/2015.

[2] Twitter, https://about.twitter.com/company; consulted on 21/01/2015.

[3] Twitter, https://twitter.com/HistorianJen; consulted on 21/01/2015.