Back to index
Thank me later!

Hadoop filesystem interface has this nice little secret which is not advertised to the outside world directly in any documentation. Your nice, clean FileSystem#get API actually caches the FileSystem objects and returns cached ones, if they are available. This perfectly makes sense till you have a filesystem, which is thread safe to be used between multiple mappers/reducers. When you have a Custom Filesystem implemenation which can't be safely re-used across different threads, there is a possibility you might end up in debugging hell. Unfortunately, I had to to face that ordeal, while implementing and using a custom SFTPFilesystem. I started getting JVM crashes, outOfMemory to create native thread errors etc. So after a day long scrounging through code and hair pulling, I figured out its the cache filesystem which is doing that, and I had to set

fs./FILESYSTEM_NAME/.impl.disable.cache=true

in the hadoop configuration. And we lived happily ever after. Phew!

Published at : Wednesday Feb,24 2016 at 23:13 | Tags : hacking, java, hadoop, | Comments
blog comments powered by Disqus Browse Archives
I am..
Azhagu Selvan SP
Atheist, FOSS enthusiast,
Pythonist, Student
Legal
Creative Commons License