https://reine-ran.medium.com/5-mind-blowing-machine-learning-deep-tech-projects-b33479318986

From time to time I would read some ML/AI/DL papers just to keep up with what’s going on in the tech industry these days — and I thought it might be a good idea to collate some interesting ones and share them with you guys in an article (coupled with some key technical concepts). So here’s a few research projects that I (personally) really like and I hope you will too:

1. Toonify Yourself

From Justin Pinkney’s blog — Source from: link

Authors: Doron Adler and Justin Pinkney

Starting off with something light-hearted, this fun little project lets you upload your own image and transform yourself into a cartoon. Image processing and manipulation probably isn’t something very new by this point in time but it is still very interesting considering that this site even provides you with the Google Collab file that allows you to try to toonify yourself + allows you to follow along the steps so you can recreate it for yourself.

Some key technical ideas/concepts:

This project is a network blending/layer swapping project in StyleGAN and they’ve used pre-trained models to do transfer learning. Specifically, they utilized two models — a fine-tuned model (created from the base model) and a base model and swapped layers between the two to produce such results. High resolution layers are taken from the base model and low resolution layers are taken from the fine-tuned model. Next, a latent vector is produced from an original image that we want to “toonify” and this latent vector would act as the input to our blended model (this latent vector would look very similar to the original image). After inputting this latent vector into the blended model, the toonified version would be created.

Read more about it here.

2. Lamphone

From Ben-Gurion University of the Negev & Weizmann Institute of Science — Source from: link

Authors: Nassi, Ben and Pirutin, Yaron and Shamir, Adi and Elovici, Yuval and Zadov, Boris

I find this to be slightly creepy — Lamphone is a novel side-channel attacking technique that allows eavesdroppers to recover speech and non-speech information by recovering sound from the optical measurements obtained from the vibrations of a light bulb and captured by the electro-optical sensor.

Some key technical ideas/concepts:

The setup involves having a telescope with an electro-optical sensor mounted on it that can detect the vibration patterns of the hanging bulb from the victim’s room. Specifically, they’ve exploited the fluctuations in air pressure around the hanging bulb (this causes the hanging bulb to vibrate). For this particular experiment, they were capable of successfully recovering the song played in the room and a snippet of Trump’s speech recording from a distance of 25 meters when they used a $400 electro-optical sensor. It is said that this distance could be extended with more high-range equipment.

Read more about it here.

3. GANPaint Studio

By researchers from MIT CSAIL, IBM Research, and the MIT-IBM Watson AI Lab — Source from: link

Authors: David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, Antonio Torralba

This was the project that I got really excited about when I was collating this list. This is basically a tool that allows you to “blend” in an image object over the highlighted portion of the original image (you get to decide where to highlight as well). I have always been really excited about exploring the idea of allowing people without formal art training to easily create new images/artworks. I find the process of creating new artworks and images to be really enjoyable creative process and hopefully with more projects like these, the machine can play an assisting role that will encourage more people to exercise their creative muscle and create art. In this case, even though their tool is not meant to create art, I don’t think it is that far-fetched to have a variation of their tool that WILL assist users to create art instead.

Some key technical ideas/concepts:

The interesting thing about this project is that it synthesizes new content that follows both the user’s intention and the natural image statistics. Their image manipulation pipeline consists of a three step process — computing the latent vector of the original image, applying a semantic vector space operation in the latent space, then finally regenerating the image from the modified image from the previous step. Here’s what they’ve written in their original paper:

Given a natural photograph as input, we first re-render the image using an image generator. More concretely, to precisely reconstruct the input image, our method not only optimizes the latent representation but also adapts the generator. The user then manipulates the photo using our interactive interface, such as by adding or removing specific objects or changing their appearances. Our method updates the latent representation according to each edit and renders the final result given the modified representation. Our results look both realistic and visually similar to the input natural photograph.

Read more about it or try out the demo here.

4. Jukebox

Image source from: link

Authors: Prafulla Dhariwal, Heewoo Jun, Christine McLeavey Payne (EQUAL CONTRIBUTORS), Jong Wook Kim, Alec Radford, Ilya Sutskever

In a nutshell, this project utilizes neural networks to generate music. Given the genre, artist and lyrics as input, Jukebox would output new music sample produced from scratch.

From OpenAI — Source from: link

Automated music generation is not exactly new technology — previously, approaches include symbolically generating music. However, for these generators, they often cannot capture essential musical elements like human voices, subtle timbres, dynamics, and expressiveness. I spent some time listening to some of the musical tracks listed in their “Jukebox Sample Explorer” page and personally felt like it’s still not at the level where listeners will be unable to differentiate between the real track VS their auto-generated track. However, I think it is definitely a exciting to see where projects like these will take us and what this means to our music industry in the near future.

Some key technical ideas/concepts:

Their approach is a two-step process — the first step consists of compressing music to discrete codes and the second step involves generating codes using transformers. They have a really nice diagram explaining this process and I’ll include them here:

Figure A — Step 1: Compression (Source from link)
Figure B — Step 2: Generation (Source from link)

How the compression part works — they utilize a modified version of the Vector Quantised-Variational AutoEncoder (VQ-VAE-2) (generative model for discrete representation learning). With reference to Figure A above, their 44kHz raw audio is compressed 8x, 32x, and 128x, with a codebook size of 2048 for each level. If you go to their site and try clicking on the sound icon to hear how each reconstructed audio sounds like, the right-most one will sound the noisiest since it is compressed 128x and only the very essential features is retained (e.g. pitch, timbre, and volume).

How the generation part works — As mentioned in the VQ-VAE paper and the VQ-VAE-2 paper, a powerful autoregressive decoder is used (but in Jukebox’s algorithm, separate decoders are used and input from the codes of each level is independently reconstructed to maximize the use of the upper levels). This generative phase is essentially about (from their official site):

(training) the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. […] The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality. […] Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.

Read more about it or try out the demo here.

5. Engaging Image Captioning via Personality

From Facebook AI Research — Source from: link

Authors: Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston

I came across this research paper when browsing through the research projects listed in Facebook AI’s conversational AI category. Traditionally, image captioning tasks are usually very factual and emotionless (e.g. labels like “This is a cat” or “Boy playing football with friends”). In this project, they aim to inject emotions or personalities into their captions to make their captions seem more humane and engaging. Citing from their paper, “ the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits”.

Hopefully, this project will be used for good — the idea of machines/bots communicating like humans seems a little unsettling to me…Nonetheless, it is still a pretty interesting project.

Some key technical ideas/concepts:

Their retrieval architecture — source from: link

From the paper:

For image representations, we employ the work of [32] (D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. CoRR, abs/1805.00932, 2018.) that uses a ResNeXt architecture trained on 3.5 billion social media images which we apply to both. For text, we use a Transformer sentence representation following [36] (P.-E. Mazare, S. Humeau, M. Raison, and A. Bordes. ´ Training Millions of Personalized Dialogue Agents. ArXiv e-prints, Sept. 2018.) trained on 1.7 billion dialogue examples. Our generative model gives a new state-of-the-art on COCO caption generation, and our retrieval architecture, TransResNet, yields the highest known R@1 score on the Flickr30k dataset.

Read more about it here.

Bonus: Three more interesting projects (I’ll just include links for now [because researching and writing this article took up way too much time rip]), but comment if you want technical explanations or summaries!) —

6. AI that clones your voice after listening for 5 seconds: link

7. AI that creates real scenes from your photos: link

8. Restoring old images: link

Concluding remarks

When I was reading through the various deep tech related research papers, I really started to worry about how easily these techniques can potentially be misused and cause detrimental consequences to our modern society as a whole. Even though some of them may seem innocuous, the gap between an innocent toy project and a powerful manipulative tool is extremely narrow. I sincerely hope there would be some frameworks or standards soon so we acquire a controlled assistant rather than unleash a monster.

That’s all I have to share for now, if I get any of the concepts or ideas wrong I do sincerely apologize — if you spot any mistakes please feel free to reach out to me so I can correct it as soon as possible.

Thank you for reading this article.

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.