Hi there I have a lot of images (lower millions) that I need to do classification on. I am using Spark and managed to read in all the images in the format of (filename1, content1), (filename2, content2) ...
into a big RDD.
images = sc.wholeTextFiles("hdfs:///user/myuser/images/image/00*")
However, I got really confused what to do with the unicode representation of the image.
Here is an example of one image/file:
(u'hdfs://NameService/user/myuser/images/image/00product.jpg', u'ufffdufffdufffdufffdx00x10JFIFx00x01x01x01x00`x00`x00x00ufffdufffdx01x1eExifx00x00II*x00x08x00x00x00x08x00x12x01x03x00x01x00x00x00x01x00x00x00x1ax01x05x00x01x00x00x00nx00x00x00x1bx01x05x00x01x00x00x00vx00x00x00(x01x03x00x01x00x00x00x02x00x00x001x01x02x00x0bx00x00x00~x00x00x002x01x02x00x14x00x00x00ufffdx00x00x00x13x02x03x00x01x00x00x00x01x00x00x00iufffdx04x00x01x00x00x00ufffdx00x00x00x00x00x00x00`x00x00x00x01x00x00x00`x00x00x00x01x00x00x00GIMP 2.8.2x00x002013:07:29 10:41:35x00x07x00x00ufffdx07x00x04x00x00x000220ufffdufffdx02x00x04x00x00x00407x00x00ufffdx07x00x04x00x00x000100x01ufffdx03x00x01x00x00x00ufffdufffdx00x00x02ufffdx04x00x01x00x00x00x04x04x00x00x03ufffdx04x00x01x00x00x00Xx01x00x00x05ufffdx04x00x01x00x00x00ufffdx00x00x00x00x00x00x00x02x00x01x00x02x00x04x00x00x00R98x00x02x00x07x00x04x00x00x000100x00x00x00x00ufffdufffdx04_http://ns.adobe.com/xap/1.0/x00<?xpacket begin='ufeff' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description xmlns:exif='http://ns.adobe.com/exif/1.0/'>
<exif:Orientation>Top-left</exif:Orientation>
<exif:XResolution>96</exif:XResolution>
<exif:YResolution>96</exif:YResolution>
<exif:ResolutionUnit>Inch</exif:ResolutionUnit>
<exif:Software>ACD Systems Digital Imaging</exif:Software>
<exif:DateTime>2013:07:29 10:37:00</exif:DateTime>
<exif:YCbCrPositioning>Centered</exif:YCbCrPositioning>
<exif:ExifVersion>Exif Version 2.2</exif:ExifVersion>
<exif:SubsecTime>407</exif:SubsecTime>
<exif:FlashPixVersion>FlashPix Version 1.0</exif:FlashPixVersion>
<exif:ColorSpace>Uncalibrated</exif:ColorSpace>
Looking closer, there are actually some characters look like the metadata like
...
<x:xmpmeta xmlns:x='adobe:ns:meta/'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description xmlns:exif='http://ns.adobe.com/exif/1.0/'>
<exif:Orientation>Top-left</exif:Orientation>
<exif:XResolution>96</exif:XResolution>
<exif:YResolution>96</exif:YResolution>
...
My previous experience was using the package scipy and related functions like 'imread' ... and the input is usually a filename. Now I really got lost what does those unicode mean and what I can do to transform it into a format that I am familiar with.
Can anyone share with me how can I read in those unicode into a scipy image (ndarray)?
See Question&Answers more detail:os