Max Friedrich

Text formats: a barrier to accessibility.

By: ​

Max Friedrich


Humans are incredibly diverse species. It is therefore very difficult to build a way of sharing information that is possible for everyone to use. This tends to mean that people who need abnormal, or very specific requirements get ignored. For example, people who are dyslexic.

Text in reality is just an encoding scheme that allows us to store and share information scalably. Scalability and data encoding schemes to areas in which Computer Science excels. Therefore, it is surprising that the solutions currently available, most of which use computers in some form, are so bad at helping dyslexics. Often, the solutions on offer are very expensive, badly implemented, unstable and provide a poor user ergonomics. This is defiantly the case with the current offerings that try to help dyslexics read text.

Due to the multitude of different formats for storing text, it is very difficult to develop a accessibility solution for helping dyslexics to read text. The core problem in trying to make text more accessible is trying to generalise the accessibility solution to all text formats. If this problem is solved effectively, almost all other accessibility problems that pertain to reading can be quickly fixed.

It is my opinion that there are four different groups of text storage formats: PDF, miscellaneous rich text formats, plain text and paper. It is my opinion that the main goal should be to convert all text to plain text so that it is as accessible as possible. It is currently very easy to convert miscellaneous rich text format to plain text using software such as Pandoc. Text on paper is slightly harder because it first needs to be scanned to an image, processed using optical character recognition technology and then finally converted to plain text, for example using Adobe Scan. These scanned documents are almost always stored in PDF format. Most academic resources are also stored in the PDF format. This means that PDF is one of the most important and overlooked impediments to accessibility, especially in academia where it is the de facto standard.

PDF is a very flexible format that allows text stored in a multitude of different ways. One of the best explanations of this is by Prof. Brailsford from the University of Nottingham in this video:

The PDF format itself is an incredibly complicated format outlined by Adobe the PDF specification however, as long as the actual text data is stored in the documents, it is possible to extract the text, making it far more accessible. PDFs do not store any information about how the different blocks of text relates to one another on a page. Most text extraction solutions for PDF fail to remove unwanted text such as headers, footers, page numbers, tables and diagrams. This means that the text extracted is not that useful without further processing. To solve this problem, I have started to build a tool in python that allows PDFs to converted to plain text. When this project is complete, it could be be used in conjunction with other tools to create a program that can convert any text format to plain text. This could then be combined with text simplification, text to speech and other tools to vastly improve the accessibility options available to dyslexics and individuals with other difficulties with reading.

I feel that creating an all in one solution that can extract text from any format will massively help in making reading more accessible to dyslexics. This does not currently exists, however, there are no technical barrier to prevent this from happening, it just has not be built.