In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects. Improving a baseline model by fine-tuning When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact,

read more How to create training (ground truth) data to improve OCR/HTR

See also: Original Source by The LaTeX Ninja

Note: The copyright belongs to the blog author and the blog. For the license, please see the linked original source blog.