The Next Big Thing Dude


PROJECT DESCRIPTION


Our project is called Dude. The purpose is to develop a device placed in common household rooms that can listen to human voice non-stop, and respond to commands like the Echo developed by Amazon. However, our Dude should also be able to “talk to” other Dude’s remotely, so that they can share each other’s information like room temperature, humidity, motion, or even those commands each Dude heard from human.

Dude is composed of multiple sensors, a speaker (and probably a projector), a microphone, and a center processing unit on the hardware side. We will also need to write proper software for functionalities like voice recognition & data sharing.


CONCEPT


Our project Dude conviently allows users to give voice command at home or office. The project integrates sensors such as microphone, camera, temperature and brightness sensor and Google Speech APIs for speech recognition and synthesize.


MOTIVATION


Today, a personal assistant application/device with speech recognition engine requires user to carry it around in order to provide continuous personal services, hereby hampering his/her independence. Using our distributed system, any user will be able to receive personal services at any location in his/her house.



COMPETITIVE ANALYSIS



Our idea is similar to Amazon Echo. Amazon Echo is designed around user’s voice. User can ask for information, music, news, weather, and more. Echo begins working as soon as it detects the wake word Alexa or Amazon. When Echo detects the wake word, it lights up and streams audio to the cloud, where Amazon Web Services recognize and respond to user’s request. Amazon Echo can also play music, manage to-do lists and set alarm.

Although voice recognition of our project cannot be as accurate as Amazon Echo because Amazon Echo has a powerful web server, we do have our edge. Unlike Amazon Echo which mainly rely on network, our product is designed to have a variety of customized sensors and actuators and can do some computation on chip. It can detect room temperature and humidity and capture human motion. Moreover, it has another way to display the information user need by projecting an image on to the ceiling or a wall.



REQUIREMENTS



Functional requirement

Dude should detect and listen to human voice in a room non-stop. It should use voice recognition to understand commands and detect environments using various sensors built-in including temperature and brightness. It should recognize different users using data from microphone and camera. Multiple Dude’s should be able to communicate sensor data over WiFi.

Timing requirement

Dude should respond to users’ commands within a relative quick way when network latency is negligible (within 3 seconds). Facial recognition should take a longer time for about 10 seconds.

Reliability requirement

Dude should continue to provide service even though it does not recognize current user or commands. Users should not be able to give queries that could have execute kernel commands.

Security requirement

Unrecognized users should not be able to retrieve information about others through Dude.



COMPONENTS AND TECHNICAL SPECIFICATION



Hardware Components

We use a microphone to collect sound input from a user and then transfer the signal to our Raspberry Pi chip. If the input sentence contains activation word "Dude", Raspberry Pi sends data to server for speaker recognition. Otherwise, it processes the command directly.

We uses temperature sensor to capture the relative humidity and temperature of the room. It uses a digital 2-wire interface and can offer high precision and excellent long term stability. The sensor will collect data continuously and send the data to the central chip. When a user query for temperature, our Raspberry Pi chip then reports temperature to user.

We uses a camera to capture ten pictures of a user's front. Then Raspberry Pi runs facial recognition locally.

We use a brightness sensor to determine whether it is dark or bright at home during the night. Further we can find out whether there is anybody at home through this information.

We use Raspberry Pi as the central chip for our project. It is interated with sensors and actuators.

The Raspberry Pi chip will gather information about user’s query and convert the result into sound signals using cloud computing. Those sound signals are going to be transmitting by speaker.

We use speaker as the actuator to guide user through making commands and feedback information.

After the user query about something through sound, the central chip will send the sound data just collected to Google voice server and wait for the respond. After the sound information is parsed into strings. The chip will either process locally (i.e. the room temperature and humidity) or search the keywords on the internet. When the result is returned, it is again passed to Google voice server to convert back into sound. Finally, the result is transmitted by the speaker.

Software Components

  • Google Speech API for speech recognition and voice synthesize.
  • Numpy and SciPy for voice analysis and speaker recognition.
  • Opencv for facial recognition.


  • Protocols

    A user would activate Dude recording by say "Dude" and wait for response. If it is user's first time using the device, Dude will say "I am Dude, who are you?" Then user can say "I am ..." or "My name is ..." to set up a profile with Dude. Then user can say commands. For first time use, user can also activate faical recognition by saying "recognize." User can set or cancel alarm with "Set alarm at ..." or "Cancel alarm at ...". Alarms will be propogated across other devices.


    Components Annotation



    USER CASE DEMO



    Temperature Demo

    The temperature sensor breakout board we end up using is MCP9808 (http://www.adafruit.com/products/1782). The sensor has range from -40C to 125C and accuracy of 0.25C. It uses simple I2C control so its compatible with our Raspberry Pi 2. The breakout board has 4 pins to connect: VDD, GND, SCL and SDA. Obviously, VDD is the 3V/5V power pin, GND is the ground pin, SCL is the I2C clock pin, and SDA is the I2C data pin. Below is a photo with the soldered breakout board.




    Speaker Recognition Demo

    1. Update sound files: when a user says his/her name after dude receives the wake word “dude”, we will create a new profile in our database and put user’s name and its corresponding sound file (the sound “dude”) in it.
    2. How does speaker recognition work: Every time when a user says “dude”, we will keep the sound file temporarily and compare it to those sound files in our database. We calculate the fourier transform and cross-correlate it with other sound files in our database. we find the sound file that has the largest correlation and if the cross-correlation is greater than a threshold, we assume there is a match.
    3. Implementation: we use numpy and scipy library in python to do fast fourier transform and cross-correlation.




    API Demo

    1. Google Translate
      • We upload a string of English sentence(s) and play the output audio file
    2. Google Speech
      • We upload a recorded audio’s flac file and write the output to a txt file
    3. Python Weather
      • Get current weather condition from weather.com
    4. Wikipedia
      • Search Wikipedia with keywords and get article summaries
    5. Datetime
      • Get the current date and time






    Final Demo Video

    Other Functionalities


    Sound Recording

    We have two types of sound recording: fixed-length & interrupt

    1. Fixed-length: the microphone records for 3 seconds, and upload the recorded data to Google Speech API to see if keyword “dude” is inside. If so, the microphone records for 5 seconds for instruction, and upload it to Google Speech API again to convert the instruction to text. If not, the microphone records for another 3 seconds.
      • Pros: Simple; easy to do speaker recognition because all audio recordings are fixed length.
      • Cons: Consumes a lot of web traffic even if there is no one speaking; can’t talk between recording intervals.
    2. Interrupt-based: the microphone watches for a certain threshold of sound level to start recording, and stops when there is certain period of silence.
      • Pros: don’t consume traffic if there is no one speaking; no constraint on when to talk.
        • Cons: hard to do well because there are different noise levels at different places; also makes speaker recognition less accurate because interval is not fixed any more.


    Server and database

    Server serves as a database and it also handles heavy computations. When a dude needs information that is shared, such as info about a specific user and a TODO list that is created earlier, it forwards request to server. Server gets information from database and may do additional computations before sending results back.

    Server code is written in Python with Tornado API. We are using MySQL for the database. It has tables about dudes, users, and users’ state. Table DUDE saves a dude’s id, location, temperature, etc. Table USER saves a user’s name, gender, preference etc. Table STATE uses user as primary key, and save information about users through name-value pairs.

    One example of heavy computation that needs to be done on server is voice recognition. We have tried to do it locally on a dude but response time is too long. We can transmit a small wav file to server side and sends back which user is speaking.

    Who's at Home

    One user can ask Dude who is using Dude at a different location.

    Set Alarm

    User can set alarm at different Dude.

    Facial Recognition

    User can activate faical recognition by saying "recognize." Then the camera will take ten pictures of user and analyze it using opencv. Dude can tell different users by comparing their looks. We plan to use this functionality on security in the future.



    ARCHITECTURE





    INTERACTION




    MEET OUR AWESOME TEAM





    Jing Huang

    Jing is responsible for web speech API / software functionalities.


    Jingtao Xu

    Jingtao is reponsible for hardware / software functionalities.


    An Wu

    An is responsible for web speech API / software functionalities.


    Hingon Miu

    Hingon is responsible for hardware / software functionalities.

    Created by Team Dude - Copyright 2015