Text formats: a barrier to accessibility.

Humans are an incredibly diverse species. It is therefore challenging to build a way of sharing information that is possible for everyone to use. This tends to mean that people who need abnormal or very specific requirements get ignored. For example, person with dyslexia.

Text, in reality, is just an encoding scheme that allows us to store and share information scalably. Scalability and data encoding schemes to areas in which Computer Science excels. Therefore, it is surprising that the solutions currently available, most of which use computers in some form are so bad at helping dyslexics. Often, the solutions on offer are costly, badly implemented, unstable and provide poor user ergonomics. This is defiantly the case with the current offerings that try to help dyslexics read text.

Due to the multitude of different formats for storing text, it is hard to develop an accessibility solution for helping dyslexics to read text. The core problem in trying to make text more accessible is trying to generalize the accessibility solution to all text formats. If this problem is solved effectively, almost all other accessibility problems that pertain to reading can be quickly fixed.

It is my opinion that there are four different groups of text storage formats: PDF, miscellaneous rich text formats, plain text and paper. It is my opinion that the main goal should be to convert all text to plain text so that it is as accessible as possible. It is currently straightforward to convert most rich text formats to plain text using software such as Pandoc. Text on paper is slightly harder because it first needs to be scanned to an image, processed using optical character recognition technology and then finally converted to plain text, for example, using Adobe Scan. These scanned documents are almost always stored in PDF format. Most academic resources are also stored in the PDF format. This means that PDF is one of the most important and overlooked impediments to accessibility, especially in academia, where it is the de facto standard.

PDF is a very flexible format that allows text stored in a multitude of different ways. One of the best explanations of this is by Prof. Brailsford from the University of Nottingham in this video:

The PDF format itself is an incredibly complicated format outlined by Adobe PDF specification however, as long as the actual text data is stored in the documents, it is possible to extract the text, making it far more accessible. PDFs do not store any information about how the different blocks of text relate to one another on a page. Most text extraction solutions for PDF fail to remove unwanted text such as headers, footers, page numbers, tables, and diagrams. This means that the text extracted is not that useful without further processing. To solve this problem, I have started to build a tool in python that allows PDFs to convert to plain text. When this project is complete, it could be used in conjunction with other tools to create a program that can convert any text format to plain text. This could then be combined with text simplification, text to speech and other tools to vastly improve the accessibility options available to dyslexics and individuals with other difficulties with reading.

I feel that creating an all-in-one solution that can extract text from any format will massively help in making reading more accessible to dyslexics. This does not currently exist, however; there are no technical barriers to prevent this from happening; it just has not been built.