ArticleBiz.com :: Free article content
Authors: Maximum article exposure. Publishers: Reprintable article content.  
BROWSE ARTICLES
ArticleBiz.com Home
Featured Articles
Recently Added Articles
Most Viewed Articles
Article Comments
Advanced Article Search
AUTHORS
Submit Article
Check Article Status
Author TOS
PUBLISHERS
RSS Article Feeds
Terms of Service

Character Encoding Recognition Made Easy
Home :: Computers & Technology :: Technology
By: Darrell Burk Email Article
Word Count: 421 Digg it | Del.icio.us it | Google it | StumbleUpon it

  

So you're writing the mother of all text editors, and your rich editing features are working beautifully. Then you hit a serious snag as you start the code that reads and decodes existing files: character sets. How can your program tell which character encoding should be used to properly read each file?

Or perhaps you're writing a custom program to convert to Unicode and archive thousands of text documents for your employer. The original documents are saved in many different encodings, and there is no easy way to correctly identify the character set for each one.

You do a little research and find that byte order markers (BOMs) might help you identify some of the UTF character sets, plus you learn some tricks that can help you recognize when a file might use the US-ASCII encoding. But these tricks aren't guaranteed-in fact, they'll probably fail as often as they work. Plus they don't help you at all with most of the two hundred or so other possible encodings.

That just isn't good enough for your application. You need software that can accurately recognize the character encoding of a text file no matter what it is. As you begin to discover the wide array of character sets and encoding strategies and contemplate the complexities involved, you conclude you'd really rather not write it.

You need EncodingSleuth Text.

EncodingSleuth Text is a powerful Java library designed specifically with your application in mind. It examines files and byte streams to determine whether they contain encoded text, and identifies the character set most likely used to encode them.

EncodingSleuth Text uses several different statistical analysis techniques-called detectors-to analyze each possible character set that might be used to decode a file, and to score each one so that the correct character set obtains the highest score. It is configurable: you can selectively enable/disable each of the detectors to tailor its operation for your specific needs. It is also extensible: you can provide your own detector implementations should the need arise.

With licensing options that allow royalty-free redistribution within your applications, and even deployment within server applications, and a price that's a fraction of the cost to develop your own encoding recognition technology, EncodingSleuth Text offers a complete and robust answer to your need.

You can download EncodingSleuth Text, request a free full-featured trial license, and peruse the documentation at http://www.encodingsleuth.com.

Darrell is president, developer, and most everything else at SynergiSystems, Inc. He launched SynergiSystems in 2007 in order to create software to make life easier for software developers.

Article Source: http://www.ArticleBiz.com

This article has been viewed 13 times.

Rate Article
Rating: 0 / 5 stars - 0 vote(s).

Article Comments
There are no comments for this article.

Leave A Reply
 Your Name
 Your Email Address [will not be published]
 Your Website [optional]
 What is eight + seven? [tell us you're human]
Notify me of followup comments via email


Related Articles


Copyright © 2009 by ArticleBiz.com. All rights reserved.

Terms of Service | Privacy Policy | Contact Us | Submit Article | Editorial