Development of Process for Generating Thai Audio Image Description

Main Article Content

Wasin Pirom

Abstract

The motivation for this research began with the requirement to develop a tool for visually impaired and blind people to improve their quality of life by allowing them to access visual information and travel safely. The process for generating Thai audio image descriptions was developed. The research has developed an image detection system that classifies objects in detected images and generates Thai language descriptions without translation from English. The DEtection TRansformer (DETR) is accurately and quickly applied to detect the objects in the image; after that, the Thai sentences or phrases are composed using the Thai Text Generator with the Thai model of WangchanBERTa datasets. The important features of the image description in this research are the ability to indicate the number of objects of the same kind and to select appropriate noun classifiers. In this developed image detection system, objects within the detected image are automatically divided into different images with classification, and then the number of images in each category is counted. The suitable noun classifier is chosen using the Masked Token Prediction. This system can reclassify using zero-shot learning when more different images are added. This allows for more flexibility in use and saves a significant amount of time in creating an image database. The Thai image description consists of the details, including the type of object, number, and Thai noun classifier of objects in the images; the generated sentence to describe the image; and the predicted photo shoot location. After that, the Thai sentences or phases of captured image description are transformed to be the voice in the Thai language using the VAJA Text-to-Speech Engine integrated image detection system to enable visually impaired and blind people to recognize the details of the image in front. The results showed a good performance of the developed process for generating Thai audio image descriptions. The input image file can be transformed into the image descriptions and the Thai audio descriptions.

Article Details

How to Cite
Pirom, W. (2024). Development of Process for Generating Thai Audio Image Description. Journal of Advanced Development in Engineering and Science, 13(38), 45–60. Retrieved from https://ph03.tci-thaijo.org/index.php/pitjournal/article/view/595
Section
Research Article

References

Gitari, M. (2023). The 7 Best Apps to Help People with Visual Impairments Recognize Objects. Available from https://www.pathstoliteracy.org/resource/7-best-apps-help-people- visual -impairments-recognize-objects. Accessed date: 16 April 2023.

TapTapSee. (2023). TapTapSee: Assistive Technology for the Blind and Visually Impaired. Available from https://taptapseeapp.com. Accessed date: 16 April 2023.

Aipoly. (2017). Aipoly Vision for Android. Available from https://download.cnet.com/ Aipoly-Vision/3000-20432_4-77580945.html. Accessed date: 16 April 2023.

D4D. (2021). Digital Service for Disability. Available from https://d4d.onde.go.th/. Accessed date: 16 April 2023.

TAB2Read. (2023). TAB2Read. Available from https://d4d.onde.go.th/app-portal/47. Accessed date: 16 April 2023.

Navilens. (2023). Navilens. Available fromhttps://www.navilens.com/en. Accessed date 16 April 2023.

Pirom,W. (2022). Object Detection and Position using CLIP with Thai Voice Command for Thai Visually Impaired. In 37th International Technical Conference on Circuits/Systems, Computers and Communications (p.391-394). 5-8 July 2022, Phuket, Thailand.

Nimmolrat, A., et al. (2021). Pharmaceutical mobile application for visually-impaired people in Thailand: development and implementation. BMC Medical Informatics and Decision Making, 21, 217(2021).

Chen, J., et al. (2015). Déjà image-captions: A corpus of expressive descriptions in repetition. In The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (p.504-514). 31 May - 5 June 2015, Denver, Colorado, USA.

Yagcioglu, S., et al. (2015). A Distributed Representation Based Query Expansion Approach for Image Captioning. In the 53r dAnnual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (p.106-111). 26-31 July, 2015, Beijing, China.

Kuznetsova, P., et al. (2012). Collective generation of natural image descriptions. In The 50thAnnual Meeting of the Association for Computational Linguistics (p.359-368). 8-14 July, 2012, Jeju, Korea.

Li, S., et al. (2011). Composing simple image descriptions using web-scale n-grams. In The 15th Conference on Computational Natural Language Learning (p.220-228). 23-24 June, 2011, Portland, Oregon, USA.

Yang, Y., et al. (2011). Corpus-guided sentence generation of natural images. In The 2011 Conference on Empirical Methods in Natural Language Processing (p.444-454). 27-31 July, 2011, Edinburgh, Scotland.

Elliott, D. & de Vries, A. P. (2015). Describing images using inferred visual dependency representations. In The 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference on Natural Language Processing (p.42-52). 26-31 July, 2015, Beijing, China.

Elliott, D. & Keller, F. (2013). Image Description using Visual Dependency Representations. In 2013 Conference on Empirical Methods in Natural Language Processing (p.1292-1302). 18-21 October, 2013, Seattle, Washington, USA.

Hodosh, M., et al. (2013). Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research, 47, 853-899.

Lebret, R., et al. (2015). Phrase-based image captioning. In International Conference on Machine Learning (p.2085-2094). 6-11 July 2015, Lille, France.

Guadarrama, S., et al. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision (p.2712-2719). 1-8 December, 2013, Sydney, Australia.

Khuphiran, P., et al. (2019). Thai Scene Graph Generation from Images and Applications. In 23rd International Computer Science and Engineering Conference (p.361-365). 30 October – 1 November 2019, Phuket, Thailand.

Mookdarsanit, P. & Mookdarsanit, L. (2020). Thai-IC: Thai Image Captioning based on CNN-RNN Architecture. International Journal of Applied Computer Technology and Information Systems, 10, 40-45.

He, S.,et al. (2020). Image Captioning through Image Transformer. In 16th Asian Conference on Computer Vision (p.153-169). 4-8 December, 2022, Macao, China.

Kulkarni, G., et al. (2013). Baby Talk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891-2903.

Elliott, D. & Keller, F. (2014). Comparing Automatic Evaluation Measures for Image Description. In The 52nd Annual Meeting of the Association for Computational Linguistics (p.452-457). 22-27 June 2014, Baltimore, Maryland.

Wikipedia. (2023). Classifier (linguistics). Available from https://en.wikipedia.org/wiki/Classifier _(linguistics). Accessed date: 16 April 2023.

VISTEC-depa Thailand AI Research Institute. (2021). WangchanBERTa: Pre-trained Thai Language Model. Available from https://airesearch.in.th/releases/wangchanberta-pre-trained-thai-language–model. Accessed date: 16 April 2023.

Lowphansirikul, L., et al. (2021). WangchanBERTa: Pre-training transformer-based Thai Language Models. Available from https://arxiv.org/abs/2101.09635. Accessed date: 16 April 2023.

Carion, N., et al. (2020). End-to-end object detection with transformers. Available from https://arxiv.org/abs/2005.12872.Accessed date: 16 April 2023.

Phatthiyaphaibun, W. (2020). TTG: Thai Text Generator. Available from https://colab.research. google.com/drive/1X6D8J0sWNi8UgJi7Hk5YL4FqepZ7laxS?usp=sharing. Accessed date: 16 April 2023.

Rezaei, M. & Shahidi, M. (2020). Zero-Shot Learning and its Applications from Autonomous Vehicles to COVID-19 Diagnosis: A Review. Intelligence-Based Medicine, 3-4, 100005.

Github. (2021). Thai Text Generator. Available from https://github.com/ PyThaiNLP/ Thai-Text-Generator. Accessed date: 9 September 2022.

Szymański, G. & Ciota, Z. (2002). Hidden Markov Models Suitable for Text Generation. Available from http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.335.93. Accessed date: 16 April 2023.

Department of Statistics. (2014). COURSE NOTES STATS 325 Stochastic Processes. Auckland: University of Auckland.

Pushp, P. K. & Srivastava, M. M.( 2017). Train once, test anywhere: Zero-shot learning for text classification. Available from https://arxiv.org/abs/1712.05972. Accessed date: 16 April 2023.

Puri, R. & Catanzaro, B. (2019). Zero-shot text classification with generative language models. Available from https://arxiv.org/abs/1912.10165. Accessed date: 16 April 2023.

NECTEC. (2016). VAJA Text-to-speech Engine. Available from https://www.nectec.or.th/ innovation/innovation-mobile-application/vaja.html date: 30 April 2023. (in Thai)