Die Stadt der „grünen Wellen“

Entwicklung eines autonomen Netzes kooperierender Ampeln und eines gekoppelten Fahrzeug-Navigationssystems

Erfahren Sie mehr über unser KI-gesteuertes Projekt für Frankfurt, das den Verkehrsfluss mit grünen Wellen neu organisiert, Staus und CO2-Emissionen reduziert und durch optimierte Ampelschaltungen jährlich bis zu 4 Millionen Euro einspart.

Frankfurt am Main ist eine der größten Wirtschaftsmetropolen Europas mit dem am stärksten belasteten Verkehrsnetz mit rund 370.000 Fahrzeugen pro Tag.

Unser Projekt, das auf Künstlicher Intelligenz basiert, schlägt eine radikale Verbesserung der Verkehrssituation in der Stadt vor, indem eine gegenseitige Koordination zwischen dem System kooperierender Ampeln und dem dazugehörigen Fahrzeugnavigationssystem entwickelt wird. Als Ergebnis wird der Autoverkehr in der Stadt automatisch in Form von „grünen Wellen“ organisiert, wodurch Staus auch während der Hauptverkehrszeiten vermieden werden.

Innovative Technologien

  • Deep Q-Network (DQN): Ampeln werden über adaptive Steuerungsfunktionen verfügen, die auf den aktuellen Verkehrsbedingungen und Deep Learning mit Q-Wert-Vorhersage von Aktionen basieren.
  • Multi-Agenten-Systeme: Jede Ampel agiert als unabhängiger Agent und koordiniert sich mit anderen Ampeln, um den gesamten Verkehrsfluss zu optimieren.

Projektziele

  • Reduzierung von Verkehrsstaus: Optimierung des kooperativen Betriebs von Lichtsignalanlagen zur Reduzierung von Verkehrsstaus.
  • Umweltverträglichkeit: Reduzierung des CO2-Ausstoßes durch Verringerung der Fahrzeugstandzeiten.
  • Zeit- und Kostenersparnis: Bessere Erreichbarkeit und kürzere Reisezeiten für Einwohner und Besucher.
Deutschland, Frankfurt, Geschäftsfrau mit verschränkten Armen, Porträt

Erwarteter Nutzen des Projekts

Unser Projekt wird dazu beitragen, die Kosten von Verkehrsstaus zu reduzieren und durch die Verringerung von Ausfallzeiten und die Verbesserung der Effizienz des Verkehrsflusses Einsparungen von bis zu 2.500.000 € pro Jahr zu erzielen.

Dies wird zu einer Verringerung der CO2-Emissionen, einer Verbesserung der Umweltbedingungen in der Stadt und einem Gesamtnutzen für die Umwelt von bis zu 500.000 € pro Jahr führen.

Die Optimierung der Ampelschaltungen wird die Reisezeiten für Einwohner und Besucher verkürzen, die Erreichbarkeit verbessern und zu Einsparungen von bis zu 1.000.000 € pro Jahr führen.

Es wird erwartet, dass die Umsetzung unseres Projekts zu jährlichen Einsparungen von insgesamt 4.000.000 € führen wird, einschließlich geringerer Staukosten, Umweltvorteilen und Zeitersparnis.

Eine Investition in dieses Projekt ist eine einzigartige Gelegenheit, zur Entwicklung einer nachhaltigen und technologisch fortschrittlichen Stadt beizutragen. Wir laden Partner und Investoren ein, sich uns bei diesem wichtigen Projekt anzuschließen und Teil eines Teams zu werden, das die Zukunft des Stadtverkehrs zum Besseren verändert.

Gemeinsam können wir Frankfurt am Main zu einem Beispiel für andere Städte weltweit machen, wie innovative Technologien das Verkehrssystem verändern und die Lebensqualität verbessern können.

Ray Tang
Analysis on Deep Reinforcement Learning Agents for Traffic Light Control

[00:00:00.00] – Ray Tang
Other people in the last two weeks, and we will continue to work with… Yes, please.

[00:00:07.04] – Ray Tang
Good afternoon, everyone. Thank you so much for being here. My name is Ray, and this presentation is going to cover what I’ve done over the past two weeks about deep reinforcement learning for traffic control. First, we can see the agenda for the presentation. Is it for? Yeah.

[00:00:27.16] – Ivan Kisel
Just a second.

[00:00:32.22] – Ray Tang
Try again. No, it’s not.

[00:00:37.19] – Ivan Kisel
Can I give me the… Yeah.

[00:01:00.01] – Ray Tang
Okay, there we go. Okay, sorry. So here’s our agenda. Here’s the slideshow. First, we can go with the introduction, the research questions we want to answer, and then we can go with an overview of how reinforcement learning works and the differences that has with normal reinforcement learning and deep reinforcement learning. And then we can show how neural networks are trained. And then we can go through the implementation and the steps we took to do the experiments. And then we can see some results and further research recommendations. So first, we can do a short introduction. So the work we’re doing today is based on a PhD thesis done in 2016 by Elisa Medical. And her thesis is basically using deep reinforcement learning to control traffic. And the research questions we wanted to answer were, are we able to have a deep reinforcement learning model control traffic? And are we able to modify it? How are we able to improve it? So now we can go over a quick overview of reinforcement learning. Reinforcement learning is when we have an agent and it takes a statement in the environment, and then it does some action to update the environment.

[00:02:39.29] – Ray Tang
And the way we represent this mathematically is using a bar composition process, which is a four element tuple, which is made up of the states, all the possible states, all the possible actions the model can take, the reward function, which outputs the reward when action is taken and changing a state, and And the transition function specifies the probability of an action being taken. So this tuple is how we represent decision making mesmerically in our models. So the way we maximize the reward is we have to tell the model what its goal is. The goal is to maximize the reward, right? So what we do is we can write what’s called a discounted return function, which is this right here. And the discounted reward function is basically the sum of all of the discounted return function. I’m sorry. It’s the sum of all the rewards. And And Gamma is a small constant called the discount factor. What Gamma does is it decreases future rewards exponentially. So as you continue training, your reward gets smaller and smaller. So now the agent knows what to do. It has its goal, but it needs strategy, it doesn’t know what action to take.

[00:03:47.29] – Ray Tang
So what we do is we can define what’s called a policy. And the policy is basically a function that maps a state to an action. So now the agent knows what actions to take in what state. And And now that we have a policy, we’re able to write a queue function. Q function estimates the expected reward of taking an action. So now we have a goal for the agent. The agent knows what action to take, and we can estimate the expected reward. So now we should be all good to go for the agent. So here is a very broad overview of how normal reinforcement learning works. We have an environment. The environment gives the state and the reward to the agent, and the agent, using the policy and queue function, decides an action to the environment. Now, the difference between normal reinforcement learning and deep reinforcement learning is that the agent is turned into a neural network. So you can see now, we can see the agent has turned into a neural network. Now, when we have neural networks, we have to train the neural network. So the way we train it, when we’re training it, our goal is to minimize the error.

[00:04:53.07] – Ray Tang
We want to make the network operate with as little error as possible. So what the error is, is when we have a correct data set for training, and when the model produces the produce outputs, we subtract them to get the error. We want to minimize the error as much as possible, so this way we can ensure the model is as accurate as possible. Now, the way we do that is what’s called gradient descent, because we have a lot of different weights. We need to adjust each one of the weights. The way we do that is on gradient descent. What gradient descent is basically Basically, we’re taking one of the weights, and then we take the partial derivative, which is just the derivative from one dimension and an n dimensional network, and we plot the error and the corresponding weight. And in this way, you can see what weight can get us the least error. And going further into this, what it does is it plots a point, and then it takes the negative direction of the derivative. So if we’re here, we know to go this way in the negative direction of the derivative to get to the place where we get the minimum error.

[00:06:04.03] – Ray Tang
So that is how the model is trained using gradient descent and how the error is minimized. So another problem that we went into is that Sometimes the training data has patterns, and this is bad because it’s going to look at the pattern and say it recognizes the pattern, but then that may not be what appears in real life. So this can lead to some problems in training. What we do to mitigate this issue is all the experience we play, which is when we take random training samples and put them in a memory. And then we save this for later until some random time we take a random sample from memory, put it back into training. So we have random samples in the memory space, and then we randomly train on these. So this way, this randomness breaks up any possible patterns, and this gets us the issue. So now we can go into how we implement implemented it and the steps we took to implement it and what libraries we used. So for our traffic simulation, we used a software called sumo, which is a powerful traffic simulation software. And as mentioned before, we need a state for the agent.

[00:07:15.06] – Ray Tang
So how we get the state is we get data from sumo such as speed, car position, acceleration, average car waiting time, et cetera. And then we put it into multiple 2D matrices, and those matrices become the data for our state. There are, however, some difficulties with using sumo because sumo is a collision-free model. So for example, it’s hard to determine a reward for yellow lights and how to tell the agent to do yellow lights because it’s between red and green lights. So to solve this issue, we just simplify it a little bit by setting yellow lights as a default time of one second. And this way the agent is able to trim the probability. And there are some randomized paths that the car is to take, and there’s also the probability of how the probability of cars appear. So when we change the probability, the data will also change, and we can see that later. So now this is just an example showing how the state is represented in a 2D matrix. This matrix represents the position of the car. So you can see zero means there’s no car and one means there’s a car.

[00:08:26.26] – Ray Tang
So it’s many matrices just like this that represent our state. Another thing before we look at the results is checkpoints. Checkpoints are special files saved each specific interval during training. And what checkpoints allows us to do is we can basically go back in time during training and see how the model performs at one specific time. So for example, in our experiment, we did one A6 trading time steps, which is one million, and we saved a checkpoint every 10,000, so we have 100 checkpoints. So for example, maybe we wanted to see how the model did at 20,000, so we can go back and then see how the model did, so we can look at it. And all of that put together, it something like this. This is the simplified pipeline. Where first we set the parameters such as the power probability, the checkpoint save and flow, whatever in a configuration file. Then we train the model, and then we save the checkpoints. And then after we save the checkpoints, we can take a checkpoint file or multiple checkpoint files, and then we can test it at that checkpoint file. After we test it, we save our data in a CSV, and then we’re able to plot our data.

[00:09:42.04] – Ray Tang
And then we should be able to see the plot of data now. So first, we trained three probabilities for 50,000 time steps, and then we set another training program to train other probabilities for a million time steps. So first, we can look at the probability for 50,000 time steps. Here’s the graph for three probabilities, 0.1, 0.2, and 0.05. And this graph is a reward. So you can see how different probabilities perform. Because there’s 0.05, there’s less cars, You can see that the war is greater because there’s less cars, and that’s better. Here is the number of vehicles. This also is intuitive because you can see 0.2 is a high probability, so there’s more vehicles. And you can see 0.1 and 0.05. Here is the average speed for the probabilities, and here is the waiting time. So all of the four previous graphs were at the 5 before checkpoint, the 50,000 checkpoint. Now we can look at the 1 million checkpoint and see how much it has improved. So here we have the reward for 1 million checkpoint, and it’s the same thing. You can see 0.05 does the best because it’s just the least is ours, and 0.4 does the worst.

[00:11:02.23] – Ray Tang
And you can see from 0.4, it actually does really bad at the end. This, we think it’s catastrophic for getting, but we need to analyze more. So catastrophic for getting is something when The queue function is updated globally. So sometimes, basically, what that means is the model is trained to handle one situation, but there’s maybe other types of situations that it can’t handle when it’s trained in one situation. So And with this, we think that’s what’s happening here. That’s our theory. Here’s the number of vehicles. Again, it’s pretty much the two of it. You can see 0.4 is too much higher because that’s a higher probability. Here’s the average speed, and here is the waiting time. And because we have more checkpoints, when we have one million, we are able to see, as mentioned before, we can see how it progresses through time. And here are some examples of that. So these are all… The probability for all So the probability for all the following are zero point two, just so we have a constant and we’re able to easily visualize, and this is the point of the reward. So on the left, we have reward for the 2.5 u flag, which is 25,000, I think.

[00:12:17.11] – Ray Tang
No, not 25,000. 250,000, I’m sorry. And this just graphs the reward over time. And then here is 5 u5. And you can see there is an improvement because greater reward means better. Here you can see the average is around negative 40, but here it’s around negative 6, so it’s doing a lot better. But there’s also a problem where we sometimes experience what’s called overfitting. So So when you think of training, you would intuitively say that more training needs to a better model. That isn’t always the case. So what overfitting is, is when a model is trained so much that, like I mentioned before, it sometimes does one situation, it handles one properly, but some other situations are really bad. And this is a good example of what happened. So these two graphs were graphing the waiting time, the average time a car experiences waiting We had a red light for a probability equals zero point zero five. And you can see for here on the left, it’s 5 before, and on the right, it’s one million. So this is five before, one million. And you can see for five before, the waiting time is much lower than a million.

[00:13:32.00] – Ray Tang
This is much more training than this experience, but this does much better than this. So we think this is because of overfitting. But again, we might need to analyze more data because it’s just one case out. But this might be a good example of overfitting. So generally, training more does help, but sometimes training more doesn’t help, which is overfitting. So now we can talk about some future research recommendations. So the original paper mentioned the use of two neural network architectures, Nips and Nature, and all the data you saw before was made by Nips. So maybe in the future, we can try to get the data from nature, or we can try to make our own model. And another future work idea is at the time of writing, there’s no documentation on it, so some documentation would help the future work. And that is it. I would like to thank Dr. For giving me this opportunity to come here and do all this work. Of the amazing people here, Akil, Arsene, Woodrock, Robin for helping me and explaining these different concepts. Also, my dad for his continuous support. Are there any questions?

[00:14:47.13] – Ivan Kisel
Thank you very much. Yes, please, questions. Very good. Can you show me the- Can you please look at? Yeah. The results you have for the one million-time checkpoint. Yes.

[00:15:07.27] – Ray Tang
Yeah.

[00:15:10.15] – Ivan Kisel
There’s one more bit in the road. Can you explain what was the… Where you talk about this the other time. Yes, this one. Can you explain what was the So you train for one million time steps?

[00:15:33.15] – Ray Tang
Yes. So we train for one million. So this is the reward graph when we test it at the one year six with one million check on file.

[00:15:45.19] – Ivan Kisel
So after training for one million time steps, you run it for testing?

[00:15:52.18] – Ray Tang
Yes, we run it for testing. So this is August and October, and this is for a thousand. Well, actually not for a thousand. They ended it before. So So when we train it, it’s one million. But when we test it, we only test it for around 4,000.

[00:16:06.17] – Ivan Kisel
Can you go to the next slide? So the number of vehicles. So you see this? When the work goes on for this group, Yeah. It’s in the same place where the number of vehicles increase. So you see the booker where it’s peaking?

[00:16:21.21] – Ray Tang
Yeah.

[00:16:22.20] – Ivan Kisel
And if you look at the previous slide, it’s the same place.

[00:16:26.07] – Ray Tang
Yes.

[00:16:26.22] – Ivan Kisel
Not the average speed? Oh, not average.

[00:16:28.03] – Ray Tang
Yes. Actually, what the reward is, is just the negative of the waiting time times some constant, because we found that this is a good way to avoid the model, because waiting time is that, right? No one likes being stuck in a traffic jam. So there’s an equation that we have where we have… Sorry, let me go back to the waiting time. I forgot where it was, but you get the point, where So the waiting time, we take the negative of the waiting time, and then we multiply it by some value. So they’re inversely correlated. So here you can see that we work those really down. We can tell this, at this point, there was a lot of waiting time. There was a lot of traffic jams here. So that’s why the reward goes down at the same time the waiting time goes up. Can you give me a question, please?

[00:17:23.02] – Ivan Kisel
For myself, could you explain again what does it mean waiting time?

[00:17:27.27] – Ray Tang
So it’s-Waiting time? Yes.

[00:17:29.16] – Ivan Kisel
What What’s the total?

[00:17:29.20] – Ray Tang
So waiting time is the average amount of time a car experience is waiting in a traffic jam. So it does vary with different train time and stuff. But waiting time is just to calculate the total waiting time, we divide by the number of vehicles, and that’s what gets us the average waiting time. And that’s what we call it here. Okay.

[00:17:52.21] – Ivan Kisel
On slide 33. Yes, on the left side. Can you explain this pattern from the middle on where you have these peaks and it goes down?

[00:18:06.01] – Ray Tang
Yes, exactly. Yes. So this is actually a result of that training. In our code, in our simulation, when a vehicle experiences waiting time for too long, so I think we have it set as one second, I’m not too sure, but when it experiences waiting for too long, it teleports away. So that just suddenly decreases the amount of vehicles, and it caused some sudden variable changes. And that’s what you see here. So whenever you see this type of pattern in our data, this means that it was doing so bad that it just had to teleport away to prevent it from stalling. So it’s just to prevent the program from just halting and stopping. So when this pattern happens, it causes some sudden changes in variables. And then this also cascades with other variables, and then it ends up as this pattern. This pattern is not just for this. Before, when we were looking at lower checkpoints, we also saw this pattern a lot, and that’s just like telephoning with just so bad that it just has to stop.

[00:19:09.19] – Ivan Kisel
But this is settings from suma, right?

[00:19:13.15] – Ray Tang
So you can change it or Yes, I think it’s something to do with sumo, but we’re also able to change it. I haven’t changed it, but maybe… I think it’s one second. I’m sorry, not one second. I think 10 seconds. It’s It’s something longer. But when it waits too long, it has to telephore away.

[00:19:36.04] – Ivan Kisel
How is the reward category?

[00:19:38.22] – Ray Tang
The reward is the negative of the waiting time times some constant. Because we don’t like the waiting time. So the higher the waiting time is, the lower the reward.

[00:19:49.12] – Ivan Kisel
How do you define the waiting time?

[00:19:53.11] – Ray Tang
The average waiting time is just how long the vehicles are waiting in traffic, in a traffic jam. So if they’ve been waiting too long, like I said before, we have them to help them away and delete them.

[00:20:06.17] – Ivan Kisel
So every car has a vehicle where it’s showing how much time it has been waiting for.

[00:20:11.18] – Ray Tang
Yes, it’s calculating how much it’s waiting.

[00:20:14.29] – Ivan Kisel
So that’s why it’s being updated every time.

[00:20:17.19] – Ray Tang
This is grabbing the reward, not the waiting time.

[00:20:24.08] – Ivan Kisel
The reward comes from the waiting time? Yes. If car is at the same place for one second and then two second, the reward is like… Going down, yes. Okay. More predictions? No. No more than that. Okay, thank you very much.

[00:20:48.03] – Ray Tang
Very good. Also, I just wanted to use code depository in future next.

[00:20:56.04] – Ivan Kisel
Thank you very much. Thank you. Thank you. Thank you. Thank you.

 

Nach oben scrollen