Automatic metrics
The comparison has been done among machine translated content compared to human translations by measuring similarity with blue score metrics.
Criticism
It ignores the difference among different words and it works on a very local level (I guess is sentence based).
Manual evaluation
Done by asking humans sometimes through crowd-sourcing and the Amazon Mechanical Turk crowd-sourcing platform.
Conclusions
The research is going on in both directions.
You can read the full article here.

Hi, I received a very interesting comment on linkedin…copy/paste the links to know more.
Kirti Vashee • There is more discussion on this at :
http://kv-emptypages.blogspot.com/2010/03/problems-with-bleu-and-new-translation.html
and
http://kv-emptypages.blogspot.com/2012/01/short-guide-to-measuring-and-comparing.html
And comments on MT quality in terms of productivity implications
http://kv-emptypages.blogspot.com/2012/03/exploring-issues-related-to-post.html
I have to post another very interesting link received through Linkedin from Sandra Williams, I will just copy paste the message:
Linda, There was a paper on human evaluation of MT output at the 2012 NAACL workshop on Predicting and improving text readability for target reader populations (PITR2012):
Tucker Maney, Linda Sibert, Dennis Perzanowski, Kalyan Gupta and Astrid Schmidt-Nielsen (2012) Toward Determining the Comprehensibility of Machine Translations.
See http://wing.comp.nus.edu.sg/~antho/W/W12/#2200