How to manually clean up weird punctuation, and scrambled non-English characters on WordPress

Oh, WordPress. This was not how I had intended to work on paratext over the summer! It seems like our 10+ years hosting platform stored years and years worth of blog posts on this website using some kind of non-UTF-8 character set. However, after our hosting provider moved this website’s backend to a new server running MySQL 8 in October last year, suddenly a whole bunch of quotation marks, em dash, en dash, ellipses and non-English characters on this site were instantly transformed into ugly-looking codes. Unfortunately, as the source data itself was corrupted in this way, the only fix was to manually find and replace all the offending characters.

After putting it off for a long time, here are two steps that helped to clean things up.

Follow these instructions from Digwp.com to use SQL to find/replace the most commonly-occurring weird characters – this is the most offensive bit to most readers, and a database find/replace means not trawling through every post to manually fix your punctuation. To do this you need access to your WP database files – though I’m sure there are WordPress plugins out there to do the same if you’re not confident in editing the database directly. The basic SQL query is here (using â€œ / “ as an example):

UPDATE wp_posts SET post_content = REPLACE(post_content, 'â€œ', '“');

Step 2: “Universal online Cyrillic decoder”, a web app that helps you to decode what on earth you previously typed in non-English text (it works for all non-English languages, not just Cyrillic). You paste in the garbled text, and use the website to work out what it was encoded in, and what’s the correct way to decode it. Once you find your original words, you can then either type it back into your post, or run find and replace in the database (see step 1). You paste in the text to be decoded, then the website lets you scroll down a dropdown menu to discover what the source encoding was, and how to convert it into readable text.

Once lost in translation: what is ÎµÎ½ Î§ÏÎ¹ÏƒÏ„Ï‰ ?

Turns out I had typed εν Χριστω (in Christ, unaccented)

It’s not absolutely perfect but when you have multilingual posts like this one, it really feels like a miracle to be able to decode what you said!

For posterity, here’s a snapshot of some of the find/replaces that were involved (a mix of punctuation and non-English characters – a window into the kinds of things we talk about on here!)

Before	After
â€œ	“
`â€`	`”`
`â€™`	`’`
`â€˜`	`‘`
`â€”`	`–`
`â€“`	`—`
`â€¢`	–
â€¦	…
Â,	,
ÎµÎ½ Î§ÏÎ¹ÏƒÏ„Ï‰	εν Χριστω
ÏƒÏ…Î½	συν
ÎµÌ“ÎºÏ„ÏÎµÌÏ†ÎµÏ„Îµ	ἐκτρέφετε
á¼ÎºÏ„ÏÎÏ†ÎµÎ¹	ἐκτρέφει
ÃŸ	ß
Ã¤	ä
Ã¼	ü
Ã¶	ö
ç‡’é´¨	燒鴨
å‰ç‡’	叉燒

Lesson learned: don’t let your hosting provider update SQL on their servers without a back-up! But if it’s too late, there’s always the find and replace and pretending you discovered the Rosetta Stone for the internet.

Related Posts