Skip to main content

How to Do OCR Text Recognition on Mac - Supports M1 Mac

1. Background

Recently I wanted to do something fun with OCR text recognition. Although I know there's Baidu OCR's mature product API, I still wanted to make one myself using tess4j. After all, life is about tinkering.

And Baidu OCR isn't completely free, as shown below:

There's a limit of 1,000 calls per month.

image.png

1.1 Tess4J

Official website: Tess4J

Tess4J is a Java JNA wrapper for the Tesseract OCR API. It allows us to easily call Tesseract for text recognition. Tesseract is an optical character recognition engine. It supports multiple operating systems and is free software under the Apache license, sponsored by Google. Tesseract is considered one of the most accurate open-source optical character recognition engines.

2. Installing Tesseract

2.1 Mac Environment

brew install tesseract

Download may fail due to the Great Firewall. It's recommended to change the mirror source. Here's how:

2.2 Changing Mac Brew Mirror Source

# Replace brew.git:
$ cd "$(brew --repo)"
# Tsinghua University:
$ git remote set-url origin https://mirrors.tuna.tsinghua.edu.cn/git/homebrew/brew.git

# Replace homebrew-core.git:
$ cd "$(brew --repo)/Library/Taps/homebrew/homebrew-core"
# Tsinghua University:
$ git remote set-url origin https://mirrors.tuna.tsinghua.edu.cn/git/homebrew/homebrew-core.git

# Replace homebrew-bottles:
# Tsinghua University:
$ echo 'export HOMEBREW_BOTTLE_DOMAIN=https://mirrors.tuna.tsinghua.edu.cn/homebrew-bottles' >> ~/.bash_profile
$ source ~/.bash_profile

# Apply changes:
$ brew update

After installation is complete, enter tesseract -v to see the version information:

image.png

2.4 Getting libtesseract.dylib Information

Then enter brew list tesseract and record the libtesseract.dylib information - you'll need it for the demo later:

asher@AsherdeMBP homebrew-core % brew list tesseract
/opt/homebrew/Cellar/tesseract/5.0.1/bin/tesseract
/opt/homebrew/Cellar/tesseract/5.0.1/include/tesseract/ (12 files)
/opt/homebrew/Cellar/tesseract/5.0.1/lib/libtesseract.5.dylib
/opt/homebrew/Cellar/tesseract/5.0.1/lib/pkgconfig/tesseract.pc
/opt/homebrew/Cellar/tesseract/5.0.1/lib/ (2 other files)
/opt/homebrew/Cellar/tesseract/5.0.1/share/tessdata/ (35 files)
asher@AsherdeMBP homebrew-core %

2.5 Downloading tessdata Files

Click the GitHub address, then git clone locally to download. The file content is quite large, so download slowly.

3. Creating a Demo Project for Testing

Here I've prepared a Chinese image - feel free to use it:

chitemp.jpg

3.1 Create a New Maven Project

image.png

3.2 Add the Tess4J Dependency

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.1.1</version>
</dependency>

3.3 Copy the libtesseract.dylib File

Go back to step 2.4, open the corresponding directory, for example mine is:

/opt/homebrew/Cellar/tesseract/5.0.1/lib

Copy the libtesseract.5.dylib file to the project's resources directory, then rename it to libtesseract.dylib.

Why not directly copy libtesseract.dylib? Because that's a symbolic link, like a shortcut in Windows.

image.png

3.3 Writing Test Code

public static void main(String[] args) throws TesseractException {
ITesseract instance = new Tesseract(); // JNA Interface Mapping
instance.setDatapath("/Users/asher/gitWorkspace/tessdata"); // path to tessdata directory
instance.setLanguage("chi_sim");
String result = instance.doOCR(new File("/Users/asher/Desktop/temp/chi_temp.jpg"));
System.out.println(result);

}

Code explanation:

  • instance.setDatapath(String); Sets the path to the tessdata files downloaded in step 2.5, which contains Simplified Chinese training data
  • instance.setLanguage(String); Sets what language you want to recognize

You can see the text is correctly recognized:

image.png

4. Troubleshooting Some Errors

4.1 Language Not Found

You need to set the correct language and download the corresponding language pack:

instance.setLanguage("chi_sim");

4.2 (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')

Your libtesseract.dylib file has a problem. Refer to step 3.3.

5. Reference Blogs

Mac Install Tesseract, Use Tess4j for OCR Recognition

Text Recognition on Mac (Tesseract-OCR for Mac)

Solving tess4j Text Recognition Errors on Mac

5.1 Want Some Extended Knowledge?

JAVA Using Tess4J for OCR Recognition

ANDROID--TESSERACT Training Recognition

A demo for Tesseract in Java with JNA (Tess4J).

Library Name - Language Table

Library Name	Language
afr Afrikaans
amh Amharic
ara Arabic
asm Assamese
aze Azerbaijani
aze_cyrl Azerbaijani - Cyrillic
bel Belarusian
ben Bengali
bod Tibetan
bos Bosnian
bul Bulgarian
cat Catalan; Valencian
ceb Cebuano
ces Czech
chi_sim Chinese - Simplified
chi_tra Chinese - Traditional
chr Cherokee
cym Welsh
dan Danish
dan_frak Danish - Fraktur
deu German
deu_frak German - Fraktur
dzo Dzongkha
ell Greek, Modern (1453-)
eng English
enm English, Middle (1100-1500)
epo Esperanto
equ Math / equation detection module
est Estonian
eus Basque
fas Persian
fin Finnish
fra French
frk Frankish
frm French, Middle (ca.1400-1600)
gle Irish
glg Galician
grc Greek, Ancient (to 1453)
guj Gujarati
hat Haitian; Haitian Creole
heb Hebrew
hin Hindi
hrv Croatian
hun Hungarian
iku Inuktitut
ind Indonesian
isl Icelandic
ita Italian
ita_old Italian - Old
jav Javanese
jpn Japanese
kan Kannada
kat Georgian
kat_old Georgian - Old
kaz Kazakh
khm Central Khmer
kir Kirghiz; Kyrgyz
kor Korean
kur Kurdish
lao Lao
lat Latin
lav Latvian
lit Lithuanian
mal Malayalam
mar Marathi
mkd Macedonian
mlt Maltese
msa Malay
mya Burmese
nep Nepali
nld Dutch; Flemish
nor Norwegian
ori Oriya
osd Orientation and script detection module
pan Panjabi; Punjabi
pol Polish
por Portuguese
pus Pushto; Pashto
ron Romanian; Moldavian; Moldovan
rus Russian
san Sanskrit
sin Sinhala; Sinhalese
slk Slovak
slk_frak Slovak - Fraktur
slv Slovenian
spa Spanish; Castilian
spa_old Spanish; Castilian - Old
sqi Albanian
srp Serbian
srp_latn Serbian - Latin
swa Swahili
swe Swedish
syr Syriac
tam Tamil
tel Telugu
tgk Tajik
tgl Tagalog
tha Thai
tir Tigrinya
tur Turkish
uig Uighur; Uyghur
ukr Ukrainian
urd Urdu
uzb Uzbek
uzb_cyrl Uzbek - Cyrillic
vie Vietnamese
yid Yiddish